14× faster embeddings: how we rebuilt the ONNX path in Manticore

推荐理由

这条记录已有公开讨论或多来源信号，适合验证热度、争议点和后续影响。

When we shipped Auto Embeddings — the feature that turns any text column into a vector automatically, with no separate model service to run — the most common piece of feedback was about speed. The previous path went through SentenceTransformers on top of Candle , Hugging Face's pure-Rust ML inference runtime, and it left a lot of CPU on the floor: most workloads sat in the low-double-digits of docs/sec no matter how we fed them, and concurrent calls serialised on a single model session.

So we spent a few weeks rebuilding how Manticore runs ONNX models. The new ONNX Runtime backend shipped in Manticore Search 27.1.5 . ONNX (Open Neural Network Exchange) is the portable model format that most of the popular open-source embedding models — MiniLM, BGE, E5, and friends — already publish. The result is a backend that's ~14× faster on average than the previous SentenceTransformers/Candle path on the same hardware (average cheap 16 cores / 32 threads server), same model, same weights, averaged over the full threads × batch

workload grid — and that advantage holds whether you run 1 client thread or 32. The old path stayed in the 5–11 docs/sec range across the entire grid; the new one lives in the 70–230 docs/sec band.

This post is the engineering log: what we tried, what surprised us, what we threw away, and what the final design looks like.

TL;DR

- ~14× faster on average than the previous SentenceTransformers/Candle path, averaged across the full threads × batch

workload grid (1 / 2 / 4 / 8 / 16 / 32 threads × batch sizes 1…128) on the same box (16 cores / 32 threads), same model, same weights.

- Released in Manticore Search 27.1.5 , the new ONNX path is now the default fast path for any HuggingFace model that ships an .onnx

file.

- On all-MiniLM-L12-v2

, the old Candle path sat at 5–11 docs/sec across every configuration we tried. The new ONNX path lands in the 70–230 docs/sec range — the same ~14× margin holds whether you run 1 client thread or 32.

- Single-insert latency on our test box: ~14 ms with a single client, ~56 ms under 8-way concurrent load — both well below the 200+ ms Candle was hitting.

- Want maximum bulk ingest throughput? Use a high batch size (32–128) on a single client thread. The new backend parallelises inside the call, so client-side fan-out just piles coordination overhead on top — peak on our box was 233 docs/sec at 1 thread + batch=64.

- The two changes that mattered most: turning intra_op_spinning off, and giving up on batching documents inside the worker.

- No user-facing API changes. A table that already points at an ONNX-capable MODEL_NAME

picks up the new path automatically. Switching an existing table to a different model isn't a one-liner — Manticore doesn't allow altering MODEL_NAME

on a FLOAT_VECTOR

field in place — but you don't have to recreate the whole table either: you can add a new column with the new model alongside, rebuild its embeddings, and drop the old one.

Why this matters

With auto-embeddings, the database itself runs the model on every INSERT

. That means embedding speed is INSERT speed — your ingest throughput is whatever the embedding step can sustain.

The old SentenceTransformers/Candle path left performance on the table. Concurrency hit lock contention, batched calls plateaued because of padding overhead, and between calls the runtime parked threads in ways that prevented the next call from picking up where the previous one left off. The headline symptom was simple: top

would show the box well under full load no matter what you threw at it. The whole sweep — single-row INSERTs, 128-row bulk INSERTs, one client thread, thirty-two client threads — sat at 5–11 docs/sec, because nothing about how you fed it could buy you more CPU.

The new ONNX path raises the floor by an order of magnitude and gives users meaningful performance tuning options. A single-thread, single-row INSERT now lands 72 docs/sec — already ~7× the old Candle ceiling. Add concurrency or batch size and it climbs into the 130–230 docs/sec range, with the top of the grid at 233 docs/sec on a single client thread at --batch-size=64. Averaged across the whole threads × batch

原始关键词#embeddings#manticore#rebuilt#faster#onnx#path

查看原文manticoresearch.com

单一来源，暂无交叉验证