fix some algo weird hashrates (like blake)
and reset device between algos, for better accuracy
but this reset doesnt seems enough to bench all algos correctly...
to test on linux, could be a driver issue...
heavy: fix first alloc and indent with tabs...
import and keep my code for older archs, like skein 64
reduce the gap between our versions...
+150kH x11 GTX 960 / +30kH 750Ti
+900kH quark GTX 960 / +230kH 750Ti
Quark and S3 are now a bit faster (+1 %)
x11 get +0.6 % (+20kH/s on a 750ti, +30kH on a 960)
80 bytes implementation to do/test ... (skein/skein2)
but keep my previous version for older devices...
reduce "false" warnings, and ignore unrelated/small ones <= 1 MB
On windows the gpu memory can be allocated by other processes
+ some cleanup in algos... (free/gpulog)
shavite is faster, echo doesn't really change due to the reg. overload
This changes allow custom lauchbounds without other code changes and improve
the portability against different devices.
also set a minimum throughput to 1024 for these algos (shared mem req. size)
But use a define in AES to use or not device initial memcpy
I already tried to use everywhere direct device constants
and its not faster for big arrays (difference is small)
also change launch bounds to reduce spills (72 regs)
to check on windows too, could improve the perf... or not
Added to most algos, checkhash function scans a big range
and can find multiple nonces at once if the difficulty is low.
Stop ignoring them, submit second one if found...
Clean the draft code for rc=2 implemented for blake and pentablake
btw... fix the reduced displayed hashrate when a nonce is found...
Signed-off-by: Tanguy Pruvot <tanguy.pruvot@gmail.com>
Tested on x11 which find sometimes 3 nonces in one call,
actually they are ignored because only the biggest was kept...
This commit doesnt fix that, but will allow to enhance shares rate later...
heavy: reduce by 256 threads default intensity to all -i 20
cuda: put static thread init bools outside the code (made once)
api: fix nvml header to build without
Sample with -i 18.5
Adding 131072 threads to intensity 18, 393216 cuda threads
And with -i 19.5
Adding 262144 threads to intensity 19, 786432 cuda threads