[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

推荐理由

这条记录涉及编程工具或代码能力更新，适合开发者评估工作流变化和可复用价值。

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs?

Mac can host large models but the prefill speed sucks, so I tested in it on my setup for Kimi 2.7.

Short answer: it helps prefill, but it does not meaningfully help decode on this setup. RPC is still mostly a capacity tool unless the network/interconnect and split mode are much better.

Setup

- Host: Mac Studio M3 Ultra, 512GB unified memory, Metal

- Worker: Linux box with NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96GB VRAM, CUDA

- Network: direct Ethernet between Mac and Linux box, but only 1GbE in practice

- Measured RPC transfer rate: about 112-113 MiB/s

- Model: unsloth/Kimi-K2.7-Code-GGUF

, UD-Q3_K_XL

- Model size on disk: about 432GB across 11 GGUF shards

- Runtime: llama.cpp server version 9827 (4c6e0ff3a)

, Unsloth build

Controlled test

Same synthetic prompt for both runs:

- Prompt tokens: 7120

- Generated tokens: 64

- temperature: 0

- ignore_eos: true

- Prompt cache disabled

- Prefill gain: about 14.8%

- Decode gain: about 4.2%

- Total request time improvement: about 12.3%

Split trend

主题标签LlamaNVIDIA开源代码端侧推理

原始关键词#generation#benchmark#improves#changes#prefill#decode

查看原文reddit.com

单一来源，暂无交叉验证