shavite is faster, echo doesn't really change due to the reg. overload
This changes allow custom lauchbounds without other code changes and improve
the portability against different devices.
also set a minimum throughput to 1024 for these algos (shared mem req. size)
But use a define in AES to use or not device initial memcpy
I already tried to use everywhere direct device constants
and its not faster for big arrays (difference is small)
also change launch bounds to reduce spills (72 regs)
to check on windows too, could improve the perf... or not
Previous echo commit was only increasing linux performance, and reducing
windows perf compared to the 1.4.9, this one seems to give at least
the 1.4.9 on windows, and the same on linux...
Shavite optimisation seems ok on both (use now 64 registers)
the launch_bounds will force the number of registers, so remove specific
Makefile rules on linux...
manual "cherry pick" with fixed line endings and some adaptations
echo : 40.056ms -> 39.241ms
cube : 14.490ms -> 13.511ms
cube hash change look like useless (__device__ code in generally inlined)
but the reality proves that cuda documentation is wrong...
tpruvot: fixed dos lines ending in echo,
and used my style for cuda function attributes