2152fd102d
both merged and unmerged implementations are broken with CUDA 6.5 No perf changes...
both merged and unmerged implementations are broken with CUDA 6.5 No perf changes...