返回
RCreddit.com
18
·开发者社区 · RSS

Qwen 3.6 27B - VLLM Performance Benchmark Results (BF16, FP8, NVFP4)

查看原文
推荐理由

这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。

Sharing some testing of Qwen 3.6 27B using VLLM across the popular quants on my development system. I used llama benchy to generate the results, then fed it into an LLM to format it the tables for readibility.

While NVFP4 is blazing fast, have had looping issues in copilot that I don't get with BF16, and the responses in general when used in agent mode seem to be less thorough than the higher quants. Based on these results, FP8 seems to be the right choice. Some of the parameters can be further tuned I'm sure to get better performance but these are were all plenty fast enough for coding purposes.

I used to use llama.cpp, but have found that VLLM is in practice is faster (due to paged attention), as well as more stable (llama.cpp would give me random errors that happen frequently, requiring me to reset the prompt or restart the service).

If you have any comments or suggestions to improve let me know.

Test System:

Motherboard: Asus Proart Z890

CPU: Intel 270K plus

RAM: 96GB DDR5 (6000MHZ)

GPU: RTX 6000 Pro Blackwell 96GB (Max-Q, ECC enabled)

Software:

OS : Ubuntu 26.04 LTS (x86_64)

Python version : 3.12.13

vLLM Version : 0.24.0

NVIDIA-SMI 595.71.05

CUDA Version: 13.2

Models:

Qwen 3.6 27B - BF16 and FP8 (HF Qwen)

Qwen 3.6 27B - NVFP4 (HF Nvidia)

* replaced the delivered jinja scripts with the fixed chat template

VLLM Parameters:

GPU_COUNT="1"

MAX_LEN="262144"

export VLLM_USE_DEEP_GEMM=0

export FLASHINFER_MAX_NUM_TOKENS=8192

主题标签Qwen
原始关键词#performance#benchmark#results#nvfp4#bf16
查看原文reddit.com
单一来源,暂无交叉验证
Qwen 3.6 27B - VLLM Performance Benchmark Results (BF16, FP8, NVFP4) · BuzzRadr