Tip: use this llama.cpp PR to improve PP on Intel ARC

https://github.com/ggml-org/llama.cpp/pull/25222

Another win for Intel ARC users (all 4 of us). The community keeps improving llama.cpp for Intel ARC. This time, the hero from that Pull Request (with the help of Claude) improved the prompt processing speed by a lot. For comparison, I have a B580 and a 116k context conversation and it used to take 510 seconds to process everything from scratch, 245t/s; now it takes 262 seconds and a very fast speed of 462t/s; Qwen3.6 35B A3B Q5_K_XL ./llama-server --host 0.0.0.0 --port 8080 --model /models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --jinja --threads 8 --ctx-size 262144 --cache-ram 0 --parallel 1 --temperature 0.0 --top-p 0.2 --top-k 20 --no-mmap --spec-type draft-mtp --spec-draft-n-max 3 --batch-size 2700 --ubatch-size 2700 --n-gpu-layers 99 --n-cpu-moe 99

. The only catch is that it is for F16 KV for now, but the contributor said he will work on other quants later.

You see, Intel's hardware is very capable of doing great things and each contribution by the community and Intel makes us closer to achieving the full speed of the hardware

主题标签Llama

原始关键词#improve#intel#arc#cpp#tip

查看原文reddit.com

单一来源，暂无交叉验证