shavite is faster, echo doesn't really change due to the reg. overload
This changes allow custom lauchbounds without other code changes and improve
the portability against different devices.
also set a minimum throughput to 1024 for these algos (shared mem req. size)
But use a define in AES to use or not device initial memcpy
I already tried to use everywhere direct device constants
and its not faster for big arrays (difference is small)
also change launch bounds to reduce spills (72 regs)
to check on windows too, could improve the perf... or not
Added to most algos, checkhash function scans a big range
and can find multiple nonces at once if the difficulty is low.
Stop ignoring them, submit second one if found...
Clean the draft code for rc=2 implemented for blake and pentablake
btw... fix the reduced displayed hashrate when a nonce is found...
Signed-off-by: Tanguy Pruvot <tanguy.pruvot@gmail.com>
Tested on x11 which find sometimes 3 nonces in one call,
actually they are ignored because only the biggest was kept...
This commit doesnt fix that, but will allow to enhance shares rate later...
heavy: reduce by 256 threads default intensity to all -i 20
cuda: put static thread init bools outside the code (made once)
api: fix nvml header to build without
Sample with -i 18.5
Adding 131072 threads to intensity 18, 393216 cuda threads
And with -i 19.5
Adding 262144 threads to intensity 19, 786432 cuda threads
Small echo rewrite. +10KHASH on the 650(compute 3.0)
tpruvot: add Linux Makefile - Force to 80 registers (else -30KH/s)
Note : the hashrate seems more constant with this change
Was maybe my fault, but the benchmark mode was
always recomputing from nonce 0.
Also fix blake if -d 1 is used (one thread but second gpu)
stats: do not use thread id as key, prefer gpu id...
Previous echo commit was only increasing linux performance, and reducing
windows perf compared to the 1.4.9, this one seems to give at least
the 1.4.9 on windows, and the same on linux...
Shavite optimisation seems ok on both (use now 64 registers)
the launch_bounds will force the number of registers, so remove specific
Makefile rules on linux...
manual "cherry pick" with fixed line endings and some adaptations
Original Commit :
Removed sharedmem and reduced calculations with precalcing (ECHO hash).
750ti + 20KHASH(x11)
tpruvot notes:
Real change is more of 10 KH/s on stock clocks (but real)
launch bounds disabled, no perf increase with 64 registers
echo : 40.056ms -> 39.241ms
cube : 14.490ms -> 13.511ms
cube hash change look like useless (__device__ code in generally inlined)
but the reality proves that cuda documentation is wrong...
tpruvot: fixed dos lines ending in echo,
and used my style for cuda function attributes
based on klaus commits, will increase a bit speed of most algos
PS: main increase is due to the register count tuning in Makefile
and for skein512 on linux, its the ROTL64
but almost no changes on X11 : 2648MH/s vs 2630 before