I merged fixes for quantized KV cache into my DeepSeek V4 branch

Check it out: https://github.com/fairydreaming/llama.cpp/tree/dsv4

They are PRs #25247 , #25303 (mine) and #25202 (from am17an) but I omitted some padding changes from the last one that I think are not necessary. So if it crashes for you let me know.

You can now fit the antirez IQ2XXS model with 1M context on a single RTX PRO 6000 (q8_0 KV cache):

$ ./bin/llama-batched-bench -m ~/projects/ds4/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf -b 2048 -ub 2048 -npl 1 -npp 2048,4096,8192,16384,32768,65536,131072,262144,524288,1048064 -ntg 128 -fa 1 --no-repack --cache-type-k q8_0 --cache-type-v q8_0 llama_batched_bench: n_kv_max = 1048576, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32 | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | |-------|--------|------|--------|----------|----------|----------|----------|----------|----------| | 2048 | 128 | 1 | 2176 | 1.144 | 1790.42 | 2.273 | 56.31 | 3.417 | 636.81 | | 4096 | 128 | 1 | 4224 | 2.223 | 1842.66 | 2.253 | 56.81 | 4.476 | 943.66 | | 8192 | 128 | 1 | 8320 | 4.600 | 1780.84 | 2.271 | 56.36 | 6.871 | 1210.88 | | 16384 | 128 | 1 | 16512 | 9.817 | 1668.91 | 2.303 | 55.57 | 12.121 | 1362.30 | | 32768 | 128 | 1 | 32896 | 21.909 | 1495.63 | 2.458 | 52.08 | 24.367 | 1350.03 | | 65536 | 128 | 1 | 65664 | 53.104 | 1234.10 | 2.614 | 48.97 | 55.718 | 1178.50 | |131072 | 128 | 1 | 131200 | 141.960 | 923.30 | 2.942 | 43.50 | 144.902 | 905.44 | |262144 | 128 | 1 | 262272 | 421.537 | 621.88 | 3.602 | 35.54 | 425.139 | 616.91 | |524288 | 128 | 1 | 524416 | 1406.481 | 372.77 | 5.217 | 24.54 | 1411.698 | 371.48 | |1048064 | 128 | 1 | 1048192 | 5202.285 | 201.46 | 8.365 | 15.30 | 5210.650 | 201.16 |

Also some perplexity values:

f16:

$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 0.00.474.417 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.392.053 I 0.10.392.174 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.392.189 I perplexity: tokenizing the input .. 0.10.924.462 I perplexity: tokenization took 532.264 ms 0.10.924.610 I perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=8192, n_seq=1 0.22.458.574 I perplexity: 11.53 seconds per pass - ETA 6.72 minutes [1]2.8897,[2]2.7710,[3]3.1873,[4]3.6052,[5]3.4648,[6]3.5705,[7]3.7952,[8]3.6431,[9]3.5904,[10]3.5542,[11]3.5701,[12]3.6851,[13]3.7128,[14]3.6751,[15]3.7551,[16]3.7644,[17]3.7564,[18]3.8208,[19]3.8337,[20]3.8398,[21]3.8507,[22]3.8847,[23]3.9882,[24]4.0528,[25]3.9720,[26]3.9313,[27]3.9123,[28]3.9423,[29]3.9668,[30]3.9640,[31]3.9817,[32]3.9912,[33]3.9735,[34]4.0053,[35]4.0242, 6.22.639.632 I Final estimate: PPL = 4.0242 +/- 0.02400

Q8_0:

$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 --cache-type-k q8_0 --cache-type-v q8_0 0.00.485.802 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.435.253 I 0.10.435.377 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.435.393 I perplexity: tokenizing the input .. 0.10.961.804 I perplexity: tokenization took 526.402 ms 0.10.961.950 I perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=8192, n_seq=1 0.22.521.970 I perplexity: 11.56 seconds per pass - ETA 6.73 minutes [1]2.8842,[2]2.7793,[3]3.1950,[4]3.6124,[5]3.4653,[6]3.5701,[7]3.8000,[8]3.6448,[9]3.5878,[10]3.5534,[11]3.5690,[12]3.6869,[13]3.7161,[14]3.6800,[15]3.7580,[16]3.7656,[17]3.7574,[18]3.8241,[19]3.8383,[20]3.8468,[21]3.8580,[22]3.8934,[23]3.9956,[24]4.0581,[25]3.9765,[26]3.9371,[27]3.9186,[28]3.9494,[29]3.9749,[30]3.9716,[31]3.9896,[32]3.9993,[33]3.9832,[34]4.0122,[35]4.0304, 6.26.279.848 I Final estimate: PPL = 4.0304 +/- 0.02407

Q4_0:

$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 --cache-type-k q4_0 --cache-type-v q4_0 0.00.435.984 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.360.658 I 0.10.360.777 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.360.794 I perplexity: tokenizing the input .. 0.10.886.143 I perplexity: tokenization took 525.34 ms 0.10.886.291 I perplexity: calculating perplexity over 35 chunks, n_ctx=8192, batch_size=8192, n_seq=1 0.22.520.679 I perplexity: 11.63 seconds per pass - ETA 6.78 minutes [1]3.0059,[2]2.8369,[3]3.2596,[4]3.6650,[5]3.5126,[6]3.6189,[7]3.8468,[8]3.6861,[9]3.6260,[10]3.5867,[11]3.5995,[12]3.7178,[13]3.7424,[14]3.7061,[15]3.7874,[16]3.7935,[17]3.7830,[18]3.8481,[19]3.8604,[20]3.8667,[21]3.8754,[22]3.9084,[23]4.0125,[24]4.0766,[25]3.9975,[26]3.9580,[27]3.9393,[28]3.9692,[29]3.9949,[30]3.9923,[31]4.0101,[32]4.0198,[33]4.0038,[34]4.0337,[35]4.0512, 6.28.034.177 I Final estimate: PPL = 4.0512 +/- 0.02420

主题标签DeepSeek

原始关键词#quantized#branch#merged#cache#fixes

查看原文reddit.com

单一来源，暂无交叉验证