DeepSeek-V4-Flash in MXFP4 is too slow on CPU

I have an old Xeon rig with 512Gb of 4-channel DDR4 2133 memory and E5-2699v4 processor. For GPU I have GTX 1060 with 6Gb of VRAM, so I use CPU only mode. I can run GLM 5.2 with 40B active parameters in Q4_K_XL at 1.8 t/s, but as you can understand it is too slow. So I wanted to give a try to a new Bartowski quantization of DeepSeek-V4-Flash with 13B active parameters in MXFP4. Unfortunately, the maximum I can get is 3.2 t/s of tg, which is very disappointing. Judging by speeds of GLM 5.2 I was expecting more than 5 t/s, while I get something as if I had 20Gb/s memory bandwidth.

Am I right to blame MXFP4 format for this miserable performance and if I am right where I can download Q4 quants of the model?

Upd: I tried antirez quant as it supposedly has mixture of Q4/Q8 and the speed is the same, even a bit worse. My conclusion is that either llama is still very inefficient with DeepSeek-V4 architecture or the structure of DeepSeek-V4 layers is such that it creates additional CPU bottlenecks.

主题标签DeepSeek

原始关键词#flash#mxfp4#slow#cpu#too

查看原文reddit.com

单一来源，暂无交叉验证