5% improvement by the vshl asm swap functions, mixed shl+add inst.,
Add also xchg(x, y) func and XCHG(x, y) define in cuda_helper for later use...
other jh changes are mainly for the beauty of the code...
Signed-off-by: Tanguy Pruvot <tanguy.pruvot@gmail.com>