Some cards have 2 gpus on board...
The core problem was the cuda hefty Thread per block set to high but took me several hours to find that... btw... +25% in heavy 12500 with 256 threads per block... vs 128 & 512 if max reg count is set to 80...
also add cudaDeviceReset() on Ctrl+C for nvprof