Tanguy Pruvot b9da6c67f5 improve jh512 with vectors (nist5,quark,sib,x11+,zr5)
the main improvement is to reduce asm calls to read global mem

but, a few more regs are used (68 mini vs 64 on SM 5.2)
so reduce the forced launch bounds to allow 80 or 128 regs per thread

Note: cuda 6.5 seems not able to store with v4.u32... (7.5 is fine)
        st.global.v4.u32        [%rd2], {%r3783, %r3824, %r3823, %r3822};
        st.global.v2.u32        [%rd2+16], {%r3821, %r3820};
        st.global.u32   [%rd2+24], %r3819;
        st.global.u32   [%rd2+28], %r3818;
        st.global.u32   [%rd2+44], %r3814;
        st.global.u32   [%rd2+40], %r3815;
        ...

todo, check alexis variant.. but wanted to keep this code before in git...
2017-01-30 07:27:01 +01:00
..