Gemma4 with audio input: 16.8 tok/s on Macbook M2 Max 64GB
Here's the setup I decided on for embedding gemma4-12b into a Tauri2 desktop app:
- Native Rust FFI into llama.cpp via llama-cpp-2
(Metal enabled)
- Model: gemma-4-12b-it-Q5_K_S
quantized by Unsloth, Q5_K - Small
- Audio input is a 607 KB 16-bit mono 16 kHz PCM WAV.
- Prompt path: Gemma chat template + llama.cpp mtmd (multimodal) audio marker, with prompt of "Transcribe this audio exactly."
- Benchmark input: 503 multimodal tokens/positions, including 486 audio tokens.
I'm not very well versed on model benchmarking, but it appears to be giving 16.8 tok/sec first-inference performance, model already loaded. The total-path speed breaks down as 2s for audio/prefill plus 3.7s for decode, with decode alone at 26 tok/s.
Does that seem like a reasonable level of performance? Any suggestions on how to speed up the inference?
Other approaches I considered:
- mlx-swift-lm
, but I need audio support, and it doesn't seem to be supported (filed issue #393)
- llama-server
in a sidecar, but that seemed harder to manage the lifecycle
- crabnebula-dev/tauri-plugin-llm
, but I don't think it supports gemma4 (filed issue #22)
The inference and benchmark test code is available in this prototyping-only repo: https://github.com/tleyden/tauri2-local-llm