返回
RCreddit.com
10
·开发者社区 · RSS

DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp

查看原文

Bartowski's DeepSeek-V4-Flash-MXFP4

GGUF, llama.cpp build 9851 ( 0eca4d490

), deepseek4

arch.

Ran the same n_ctx = 10240

, same n_ubatch = n_batch = 8192

, flash attention on — only difference is -ctk

/ -ctv

Cache type Total KV cache (CUDA0) CUDA0 compute buffer f16 (default, no -ctk

set) ~425 MiB 12,964 MiB q8_0 ( -ctk q8_0 -ctv q8_0

) ~226 MiB 3,973 MiB So switching the KV cache quant type only saves ~200MB of actual cache (expected — DSV4's compressed CSA/HCA/lightning-indexer caches are tiny either way), but it shaves ~9GB off the compute buffer — a 3.26x difference — with literally nothing else changed.

This is what was actually causing my OOM at higher context (35.9GB compute buffer requested at ctx=32000 with f16 cache, on a 32GB card). Once I forced q8_0 cache, it loads fine.

Does forcing -ctk q8_0 -ctv q8_0

cut your compute buffer by a similar ~3x?

主题标签LlamaDeepSeek
原始关键词#compute#anyone#buffer#scales#seeing
查看原文reddit.com
单一来源,暂无交叉验证
DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp · BuzzRadr