All can be freed propertly now, except script (reset) and lyra2 (leak)
Yes, its a big commit, was waiting 1.6 to do that... Sorry for your possible merge issues ;)
The core problem was the cuda hefty Thread per block set to high but took me several hours to find that... btw... +25% in heavy 12500 with 256 threads per block... vs 128 & 512 if max reg count is set to 80...