Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository.
At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.
Why Tokenization Became a Bottleneck
LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.
A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.
Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.
What is Unigram Tokenization?
Unigram tokenization was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a most-probable-path problem. Each vocabulary token has a learned log-probability. The tokenizer picks the segmentation whose token scores sum to the highest value.
The algorithm used to find that best path is the Viterbi algorithm, a dynamic programming technique from 1967. Byte positions form graph layers and vocabulary tokens are edges spanning a contiguous byte range. The DP recurrence iterates over byte positions and updates the best-scoring path at each position.
The outer loop runs in linear time relative to input length. The inner loop walks a vocabulary trie (a prefix tree structure) at each byte position. On a 16K-token input, this inner walk executes hundreds of thousands of trie transitions. It is the hot path.
What was Slow in the Hugging Face Implementation
The Hugging Face tokenizers crate is the default Rust tokenizer most teams reach for. Perplexity used it as the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three costly patterns:
BottleneckMechanismMeasured impactAllocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocations at 514 tokens; 299,171 at 16KPointer chase per byteAHashMap at every trie node; 4 dependent loads per byte stepDependent-load latency dominates the hot pathL2 thrashing on long inputsDP table and output buffers freshly allocated each callL2 miss rate climbs from 8% at 128 tokens to 50% at 16K
Per-token allocation is constant: roughly 2 KB and ~18 allocations per token, regardless of input size. The latency problem becomes severe at longer inputs when cumulative allocations overflow the per-core L2 cache.
Establishing a Baseline Before Changing the Trie
Before switching the trie structure, Perplexity first isolated how much cost came from unnecessary work alone. They made a zero-allocation port of the reference: same HashMap trie, but with a caller-owned scratch struct reused across calls and token IDs stored directly in trie nodes (removing the per-match string allocation and secondary hash-map lookup).
This baseline already cut p50 latency to 155 µs at 514 tokens, down from 326 µs in the reference. Instructions retired dropped 2.4x. The remaining cost was the HashMap pointer chase itself, which the next step addressed.
The Three Optimizations
Optimization 1: Double-Array Trie
The Hugging Face trie stores children in a HashMap at every node. Each byte step requires a hash computation, two pointer dereferences, and a heap access. Perplexity replaced this with a double-array trie, the same structure used by SentencePiece and IREE, originally introduced by Aoe in 1989.
A double-array trie encodes the entire trie in two flat integer arrays, base and check. A child lookup is: next = base[node] + byte, then verify check[next] == node. That is two array reads, one integer add, and one comparison, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the whole trie fits in ~9 MB of contiguous memory. The hot working set per encode is on the order of 100 KB, which fits in L2 cache.
Unlike SentencePiece and IREE, which are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie directly in the Viterbi loop and dropped that overhead entirely.
Result at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the original reference.
Optimization 2: Bitmap and Inline Packing
The double-array trie still requires two dependent array loads per byte step: first the parent’s base offset, then the check array to confirm the transition is valid. Perplexity replaced the check array with a per-node bitmap (four 64-bit words, 32 bytes) that records which of the 256 possible bytes have valid child transitions.
A bitmap lookup compiles to a single bit test against one 64-bit word. The check array is used only during trie construction and dropped from the runtime layout entirely.
They also packed all four per-node fields (bitmap, base, token ID, and score) into a single 64-byte cache line, matching CPU cache line width exactly. One trie step now loads a single cache line covering the bitmap for the next-byte check, the base offset for the child slot, and the token ID and score at terminal nodes.
Trade-off: trie size grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The hot working set per encode remains ~100 KB.
Result at 514 tokens: Additional 4.5% wall-clock reduction. L2 accesses dropped from 4.6K to 1.8K per encode.
Optimization 3: Huge Pages for the Trie
At 50 MB, the trie spans roughly 12,000 virtual pages on a default Linux system using 4 KB pages. The first-level data TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a different trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode budget.
Perplexity backed the trie with 2 MB huge pages via mmap with the MAP_HUGETLB flag. The same 50 MB now spans 25 pages, well within the TLB. This requires vm.nr_hugepages configured at boot. In production, 10,561 huge pages are reserved; the trie uses 24.
Result: 3-12% wall-clock reduction depending on input length. The largest gain is at 4,098 tokens (-12.0%), where page-table traffic was actively competing with trie data for L2 bandwidth. Beyond 4K tokens the gain shrinks because L3 misses dominate.
Final Benchmark Results
All measurements are single-threaded, pinned to one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens:
Enginep50 LatencyInstructionsAllocationsHugging Face (tokenizers crate)349 µs3.60M7,295SentencePiece (C++)128 µs1.83M1,559IREE tokenizer (C)112 µs2.28M1Perplexity (final, all 3 optimizations)~63 µs1.04M0
Across the full optimization sequence, instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Wall-clock matches that ratio at short inputs and widens at long inputs where the reference’s per-token allocations overflow L2 and L3.
One additional finding: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead compared to the native C/C++ binaries. The sentencepiece crate allocates a fresh list of token pieces on each call. The overhead is measurable but amortizes at long inputs.
The final Perplexity encoder produces token-exact output against the reference. In production, it uses rayon to parallelize across cores.
Marktechpost’s Visual Explainer
Key Takeaways
Perplexity rebuilt their Unigram tokenizer targeting XLM-RoBERTa’s 250K-token SentencePiece vocabulary
The new encoder achieves zero steady-state heap allocations and ~63 µs p50 at 514 tokens
Three optimizations: double-array trie, bitmap + 64-byte cache-line packing, and 2 MB huge pages for the trie
Intermediate result: a zero-allocation HashMap port alone cut p50 from 326 µs to 155 µs before the trie was changed
Production impact: 5-6x CPU utilization reduction and double-digit ms reduction in reranker latency
Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate appeared first on MarkTechPost.