Look like a stream synch problem, not related to cuda 7 headers or cudart
The threadfence() added doesnt changes performances, and could also
be related to the random cpu validation errors... so keep it for all.
Note: the 80-bytes variant used in skein2 doesn't seems affected.