But use a define in AES to use or not device initial memcpy
I already tried to use everywhere direct device constants
and its not faster for big arrays (difference is small)
also change launch bounds to reduce spills (72 regs)
to check on windows too, could improve the perf... or not