the main improvement is to reduce asm calls to read global mem
but, a few more regs are used (68 mini vs 64 on SM 5.2)
so reduce the forced launch bounds to allow 80 or 128 regs per thread
Note: cuda 6.5 seems not able to store with v4.u32... (7.5 is fine)
st.global.v4.u32 [%rd2], {%r3783, %r3824, %r3823, %r3822};
st.global.v2.u32 [%rd2+16], {%r3821, %r3820};
st.global.u32 [%rd2+24], %r3819;
st.global.u32 [%rd2+28], %r3818;
st.global.u32 [%rd2+44], %r3814;
st.global.u32 [%rd2+40], %r3815;
...
todo, check alexis variant.. but wanted to keep this code before in git...
5% improvement by the vshl asm swap functions, mixed shl+add inst.,
Add also xchg(x, y) func and XCHG(x, y) define in cuda_helper for later use...
other jh changes are mainly for the beauty of the code...
Signed-off-by: Tanguy Pruvot <tanguy.pruvot@gmail.com>