import and keep my code for older archs, like skein 64
reduce the gap between our versions...
+150kH x11 GTX 960 / +30kH 750Ti
+900kH quark GTX 960 / +230kH 750Ti
heavy: reduce by 256 threads default intensity to all -i 20
cuda: put static thread init bools outside the code (made once)
api: fix nvml header to build without