Mostly to do compatibilty tests, SM 2.1 support is very limited
SM 3.0 code should run on SM 3.5 (only a few cards use this arch)
As i can't test SM 3.5, its best to let users do their own tests...
Reduce a bit the 750Ti speed but improve a lot the 9xx speed.
Keep compat for SM 3/3.5 in a second file..
Note: With this code and Cuda 7.5, the speed won is the reverse...
May be "reverted" soon