Gemma4 with audio input: 16.8 tok/s on Macbook M2 Max 64GB

查看原文

Here's the setup I decided on for embedding gemma4-12b into a Tauri2 desktop app:

- Native Rust FFI into llama.cpp via llama-cpp-2

(Metal enabled)

- Model: gemma-4-12b-it-Q5_K_S

quantized by Unsloth, Q5_K - Small

- Audio input is a 607 KB 16-bit mono 16 kHz PCM WAV.

- Prompt path: Gemma chat template + llama.cpp mtmd (multimodal) audio marker, with prompt of "Transcribe this audio exactly."

- Benchmark input: 503 multimodal tokens/positions, including 486 audio tokens.

I'm not very well versed on model benchmarking, but it appears to be giving 16.8 tok/sec first-inference performance, model already loaded. The total-path speed breaks down as 2s for audio/prefill plus 3.7s for decode, with decode alone at 26 tok/s.

Does that seem like a reasonable level of performance? Any suggestions on how to speed up the inference?

Other approaches I considered:

- mlx-swift-lm

, but I need audio support, and it doesn't seem to be supported (filed issue #393)

- llama-server

in a sidecar, but that seemed harder to manage the lifecycle

- crabnebula-dev/tauri-plugin-llm

, but I don't think it supports gemma4 (filed issue #22)

The inference and benchmark test code is available in this prototyping-only repo: https://github.com/tleyden/tauri2-local-llm

原始关键词#macbook#gemma4#audio#input#64gb#max

查看原文reddit.com

单一来源，暂无交叉验证