Is dSpark, dflash, MTP, QAT, and similar tech going to increase inference speed enough to where model spillover to disk will be more tolerable?
We’re seeing all these performance boosts coming to inference lately with things like dSpark, dllash, MTP, etc. and I know the whole model spillover-to-disk has always been the inflection point where a model would go from maybe a barely acceptable 4 to 5 tokens per second to like a completely unusable 0.5 tokens per sec after disk spillover happens. Has this changed now?
Do these new speed boosters push the inference speed to the point where model spillover to disk isn’t as bad of a performance hit as it was before? Are people now seeing barely acceptable performance using dSpark + disk spillover? Or does it not provide enough improvement to where it would matter in this scenario? I have no illusions that it would probably make enough difference to make it viable, but I’m just wondering what people are finding who have tried these new improvements out with spillover.