Is dSpark, dflash, MTP, QAT, and similar tech going to increase inference speed enough to where model spillover to disk will be more tolerable?

We’re seeing all these performance boosts coming to inference lately with things like dSpark, dllash, MTP, etc. and I know the whole model spillover-to-disk has always been the inflection point where a model would go from maybe a barely acceptable 4 to 5 tokens per second to like a completely unusable 0.5 tokens per sec after disk spillover happens. Has this changed now?

Do these new speed boosters push the inference speed to the point where model spillover to disk isn’t as bad of a performance hit as it was before? Are people now seeing barely acceptable performance using dSpark + disk spillover? Or does it not provide enough improvement to where it would matter in this scenario? I have no illusions that it would probably make enough difference to make it viable, but I’m just wondering what people are finding who have tried these new improvements out with spillover.

主题标签模型发布

原始关键词#inference#spillover#tolerable#increase#similar#dflash

查看原文reddit.com

单一来源，暂无交叉验证