Model-ID Stem Analysis: quality / speed / size rankings and combinability
A companion to Model-ID glossary. This report analyzes the
architectural and quantization stems (Q4_K_M, Q8_0, NVFP4, MXFP8, BF16, FP8,
MLX, QAT, MTP, MoE active-params, distill, YaRN, LASER, MatFormer, etc.) and ranks them on:
- Quality - generation quality, instruction following, reasoning/math/code fidelity.
- Speed - TTFT (time to first token = prefill latency) and TPS (tokens/sec = decode throughput). These two move differently, which is why some rankings split.
- Size - on disk and in RAM/VRAM.
- across Mac (Apple Silicon), Linux (NVIDIA), and WSL2 (NVIDIA).
Plus a combination matrix: which stems stack and which are mutually exclusive (e.g. can I have QAT with Q4_K_M? NVFP4 with QAT and MLX on Ollama?).
Figures below are drawn from published 2026 benchmarks (linked inline and in the Bibliography). Absolute throughput and the exact ordering depend on the model, context length, batch size, GPU generation, and runtime version; where the order flips, the governing factor is called out.
Contents
- Model-ID Stem Analysis: quality / speed / size rankings and combinability
- Contents
- 1. The one idea that makes this tractable: three orthogonal axes
- 2. Why TTFT and TPS rank differently
- 3. Platform reality: what can even run where
- 4. Axis A - numeric encodings ranked
- 5. Axis B - runtime / container: MLX vs GGUF
- 6. Axis C - build-time techniques ranked
- 7. Per-stem reference cards
- 8. Combination matrix: what stacks and what is mutually exclusive
- 9. Worked answers to the example questions
- 10. Where the ranking depends on other factors
- Verification pass (v2)
- Bibliography
1. The one idea that makes this tractable: three orthogonal axes
Every stem in a model tag belongs to exactly one of three independent axes. You pick one value from Axis A and one from Axis B, and stack any number from Axis C:
| Axis | What it is | Pick how many | Stems |
|---|---|---|---|
| A. Numeric encoding | how the weights are stored (bits + scheme) | exactly one per tensor group | F16/FP16, BF16, FP8, INT8/INT4, Q8_0, Q6_K, Q5_K_M/Q5_K_S/Q5_0/Q5_1, Q4_K_M/Q4_K_S/Q4_0/Q4_1, Q3_K_L/Q3_K_M/Q3_K_S, Q2_K, NVFP4, MXFP4, MXFP8 |
| B. Runtime / container | what executes the weights | exactly one at run time | MLX (Apple), GGUF/llama.cpp (cross-platform), vendor GPU stacks (TensorRT/vLLM) |
| C. Build-time technique | how the model was trained/derived | any number | QAT, MTP, MoE/A22B, distill, DPO, LASER, YaRN/gradient, MatFormer (E2B/E4B), abliteration/uncensored |
The single most common confusion - “can I combine Q4_K_M and NVFP4?” - dissolves here: both
are Axis A, so no, you choose one. But QAT (Axis C) + Q4_K_M (Axis A) + GGUF (Axis B)?
Yes - that is exactly what a Gemma QAT GGUF is.
(Google Gemma 4 QAT,
Ollama goes MLX)
2. Why TTFT and TPS rank differently
- Prefill (reading your prompt, producing the first token) is compute-bound - big matrix-matrix multiplies, high arithmetic intensity. It sets TTFT.
- Decode (each subsequent token) is memory-bandwidth-bound - matrix-vector ops that repeatedly stream the weights and a growing KV cache. It sets TPS.
Consequence: quantization mainly buys TPS, not TTFT, because it shrinks bytes moved per token during the bandwidth-bound decode phase; it does little for the compute-bound prefill. A format can therefore win on TPS and lose on TTFT (this is exactly the MLX-vs-llama.cpp story below). (Towards Data Science - prefill compute-bound, decode memory-bound, Redis - Prefill vs Decode, Why LLM inference is memory-bound)
Rules of thumb used throughout:
- Disk size ≈ bits-per-weight × parameters. RAM/VRAM ≈ disk + KV cache + runtime overhead.
- Lower bits ⇒ smaller + faster decode (higher TPS), at some quality cost.
- TTFT is governed far more by runtime, batching, Flash-Attention, and GPU compute than by the weight format.
3. Platform reality: what can even run where
Before ranking, note that some Axis-A/Axis-B choices simply don’t exist on some platforms. Native = dedicated hardware path (fast); emulated = upcast/dequantize in software (works, not fast).
| Stem | Mac (Apple Silicon) | Linux (NVIDIA) | WSL2 (NVIDIA) | Pure CPU |
|---|---|---|---|---|
GGUF Q2_K…Q8_0, F16 |
Native (Metal) | Native (CUDA) | Native (CUDA) | Yes |
BF16 |
Yes (Metal/CPU) | Native | Native | Yes (slow) |
FP8 |
Native on M5/A19 GPU Neural Accelerators; emulated on M1-M4 | Native on Hopper+ | Native on Hopper+ | No |
NVFP4 |
Ollama 0.19 MLX backend; on M5 the GPU Neural Accelerators accelerate the matmul (native FP8/INT4 paths); no dedicated NVFP4 tensor unit; M1-M4 storage/bandwidth only | Native only on Blackwell; emulated on Hopper/Ampere | Native only on Blackwell | No |
MXFP4 / MXFP8 |
M5 GPU Neural Accelerators (native FP8/INT4); M1-M4 emulated | Native on Blackwell-class; else emulated | Native on Blackwell-class | No |
MLX (runtime) |
Apple-silicon only | N/A | N/A | N/A |
| GGUF runtime (llama.cpp/Ollama) | Yes | Yes | Yes | Yes |
Key platform facts:
- MLX is Mac-only. There is no MLX on Linux or WSL. (ml-explore/mlx)
- NVFP4/MXFP4 are Blackwell-native on NVIDIA. On Hopper/Ampere they are emulated; FP8 needs
Hopper+. NVFP4 is merged in llama.cpp (
GGML_TYPE_NVFP4 = 40, kernels landed late Mar-Apr 2026; Blackwell tensor-core dispatch in PR #22196); non-Blackwell NVIDIA cards get the memory savings only. MXFP4 (OCP variant) lives in ik_llama.cpp. (NVIDIA Introducing NVFP4, Spheron FP4 on Blackwell, llama.cpp PR #19769 (NVFP4), InsiderLLM FP4 in llama.cpp) - On Apple Silicon, low precision is real on M5, emulated before it. Ollama 0.19 (preview, released 2026-03-30) rebuilds the Mac stack on Apple’s MLX framework and uses NVFP4 for 4-bit, while keeping llama.cpp for Linux/Windows. M5 / M5 Pro / M5 Max (and A19) GPUs add GPU Neural Accelerators - matrix units that natively run FP8 and INT4 - so on M5 the 4-bit path is a genuine compute win (Ollama measured ~2x: e.g. M5 Max + Qwen3.5-35B-A3B NVFP4, prefill 1,154→1,810 tok/s, decode 58→112 tok/s), not just a bandwidth win. There is still no Apple equivalent of NVIDIA’s dedicated NVFP4 tensor unit, and on M1-M4 (no matrix accelerators) the FP4/FP8 benefit is storage/bandwidth only. (Ollama blog - now powered by MLX, MacRumors - Ollama faster on Macs, Apple ML Research - LLMs with MLX on M5, tzakharko - A19/M5 Neural Accelerators benchmark)
- Linux vs WSL2: with NVIDIA GPU passthrough, WSL2 CUDA compute is near-native; the practical gaps are slightly higher model-load/disk-I/O latency and occasional driver/VRAM-reporting quirks. Compute-bound TTFT and decode TPS are within a few percent of bare Linux; treat them as equivalent for ranking, with WSL2 a touch behind on cold-start/load and on very large models near the VRAM limit.
- Pure CPU (any OS): only GGUF int/float formats; lower bits help most because CPU memory bandwidth is the binding constraint. (Markaicode CPU benchmark)
4. Axis A - numeric encodings ranked
All else equal (same model, same runtime). “Quality” = closeness to FP16 baseline.
4.1 Quality (best → worst)
F16/BF16 ≈ Q8_0 > Q6_K > Q5_K_M > Q5_K_S > Q4_K_M > Q4_K_S > Q4_0/Q4_1
> Q3_K_L > Q3_K_M > Q3_K_S > Q2_K
Q8_0is effectively lossless;Q6_K~+2% over Q4_K_M baseline,Q5_K_M~+1.5%, andQ4_K_Mretains ~92-95% of FP16 quality - the standard “sweet spot.” Below Q4, quality degrades fast, andQ2_Kis noticeably worse (use only when nothing else fits). (RunAIHome Q4/Q5/Q6/Q8 quality loss, WillItRunAI quant guide)- k-quant (
_K_) beats legacy (_0/_1) at equal bit-width because of the super-block scale structure; within a level,_M>_S(more tensors kept at higher precision)._1≥_0(asymmetric: scale + min). (llama.cpp discussion #2094) - FP4 family for quality:
NVFP4>MXFP4at the same 4 bits (~88% lower quantization error; block-16 + FP8 scale vs block-32 + power-of-two scale). NVFP4 can match/slightly beat FP8 on some reasoning tasks when paired with FP8/BF16 attention (mixed precision), not as pure FP4. (Edge-AI-Vision NVFP4 impact, iFactory FP4 vs FP8 vs FP16) - Rough quality placement of the GPU formats among the GGUF ladder:
BF16 ≈ FP16 > FP8 ≈ Q8_0 > Q6_K > NVFP4 ≳ Q5_K_M ≳ MXFP8(4-bit-ish use) ≳ Q4_K_M > MXFP4. (NVFP4 lands near a high-quality 4-5 bit GGUF; MXFP4 near a plain 4-bit.)
4.2 TPS / decode throughput (fastest → slowest)
On a Blackwell GPU, FP4 has dedicated 4x-throughput tensor cores, so:
NVFP4 ≈ MXFP4 > FP8/MXFP8 > BF16/F16 (Blackwell: ~4x / ~2x / ~1x stair)
On hardware without an FP4 tensor unit (non-Blackwell NVIDIA, CPU, and pre-M5 Apple) decode is purely bandwidth-bound and TPS tracks fewer bits = faster:
Q2_K > Q3_K_* > Q4_0/Q4_K_S > Q4_K_M > Q5_K_* > Q6_K > Q8_0 > F16/BF16
The fuller picture on Apple Metal (correcting a too-broad statement that earlier lumped all Apple GPUs in as “no FP4 tensor cores, bandwidth-only”): M5 / A19 GPUs add GPU Neural Accelerators that execute FP8 and INT4 natively, so on M5-class Macs low-bit decode gets a real compute speedup (not just a bandwidth win), and the order above understates how fast 4-bit/8-bit run there. Apple still has no dedicated NVFP4 unit like Blackwell’s, so NVFP4 specifically leans on those native FP8/INT4 paths plus memory savings; M1-M4 Macs have no matrix accelerators at all and remain purely bandwidth-bound as the ladder shows. (Apple ML Research - LLMs with MLX on M5, tzakharko - A19/M5 Neural Accelerators benchmark)
Example magnitudes: RTX 4090, Llama-3.2-8B - Q4_K_M 112 tok/s vs Q8_0 83 tok/s (+35%); CPU
12-thread - Q4_K_M 14.2 vs Q5_K_M 8.5 vs Q8_0 6.8 tok/s.
(Markaicode Ollama quant benchmark,
dasroot GGUF quality vs speed)
4.3 TTFT (best → worst)
Weight format has second-order effect on TTFT (prefill is compute-bound). The first-order levers are GPU compute class and runtime. That said:
- On Blackwell,
NVFP4/FP8also give the best TTFT because the prefill matmuls run on the faster low-precision tensor cores. (Edge-AI-Vision) - On Mac/CPU/non-Blackwell, TTFT is roughly format-insensitive; a lower-bit model is not meaningfully faster to first token, and can even be marginally slower if it must dequantize.
4.4 Size on disk and in RAM (smallest → largest)
Tracks bits-per-weight directly:
Q2_K(~2.6) < Q3_K_S/M/L(~3.4) < NVFP4/MXFP4(~4.0-4.25) < Q4_0/Q4_K_S(~4.3) < Q4_K_M(~4.5)
< Q5_K_*(~5.5) < Q6_K(~6.5) < MXFP8/FP8(~8) ≈ Q8_0(~8.5) < BF16/F16(16)
7B reference: FP16 ~13.5 GB → Q4_K_M ~4.1 GB → Q2_K ~2.8 GB. RAM/VRAM ≈ disk + KV cache +
overhead (RTX 4090: Q4_K_M 5.8 GB resident vs Q8_0 9.1 GB). On Apple unified memory, “RAM”
is the budget for both weights and KV cache.
(Vucense GGUF sizes)
5. Axis B - runtime / container: MLX vs GGUF
Same weights, different engine. Mac only (the only platform where you actually choose).
| Metric | MLX | GGUF (llama.cpp / Ollama) | Who wins |
|---|---|---|---|
| Decode TPS | ~1.4-1.8x raw llama.cpp; bigger lead at long context (KV stays in unified memory) | baseline | MLX for sustained generation |
| TTFT / prefill | often slower | faster (Metal + Flash-Attention) | GGUF for short prompts / snappy first token |
| Long context (30K+) | KV-cache-efficient, but decode runs ~50% slower than llama.cpp+FlashAttn because MLX’s attention kernel is not yet IO-aware (FlashAttention-style); an open mlx-lm issue tracks adding it | strong with Flash-Attention | GGUF at long context |
| Ecosystem & portability | Apple-only; needs MLX conversion | universal GGUF; runs everywhere | GGUF |
| Quantization scheme | MLX’s own (e.g. 4-bit MLX, mlx-bf16) |
GGUF Q*/F16 |
n/a (different schemes) |
Net: on a Mac, MLX for max TPS on long generations; GGUF/Ollama for best TTFT, portability, and the widest model selection. Ollama’s overhead means “MLX is 3x faster” claims are mostly Ollama-vs-MLX, not llama.cpp-vs-MLX. (Towards AI MLX vs llama.cpp, Ante Kapetanovic Qwen3.5 Apple Silicon, yage.ai MLX vs llama.cpp, Contra Collective GGUF vs MLX)
6. Axis C - build-time techniques ranked
These travel with the weights and are largely platform-independent. They do not change the numeric format; they change quality, or (for MoE/MTP) the speed/size math.
| Stem | Quality effect | TPS effect | TTFT effect | Size effect | Notes |
|---|---|---|---|---|---|
QAT |
+ (recovers up to ~70% of quant loss; +1-3% GPQA/MMLU-Pro vs PTQ) | none (same format) | none | none | makes a low-bit model behave like a higher-bit one |
MTP |
neutral | ++ (~1.8x via speculative decode, runtime-permitting) | slight + | tiny (extra head) | needs runtime support to realize the speedup |
MoE / A22B |
high quality per active-FLOP | ++ (decode runs at active params, e.g. 22B) | + | −− (RAM/disk = total params, e.g. 235B) | great speed/quality, heavy memory |
distill |
slight − vs teacher, big + vs same-size base | + (smaller student) | + | + (smaller) | how small reasoning models get R1-like skills |
DPO |
+ (alignment/preference) | none | none | none | post-training preference alignment |
LASER |
+ on targeted tasks (can also regress) | tiny + (low-rank) | none | tiny − | SVD rank-reduction on select layers |
YaRN/gradient |
neutral at short ctx; enables long ctx | − at long ctx (more KV) | − at long ctx | KV cache grows | context-window extension |
MatFormer E2B/E4B |
E4B > E2B | E2B faster | E2B faster | stored > effective (PLE) | pick a nested submodel size |
uncensored/abliteration |
removes refusals; may slightly dent benchmark quality | none | none | none | behavior change, not size |
Sources: Unsloth QAT, NVIDIA QAT accuracy recovery, Sebastian Raschka MTP, Qwen3 Technical Report (MoE), EmergentMind DeepSeek-R1 distilled, YaRN (arXiv), HF MatFormer in Gemma 3n, LASER (arXiv).
7. Per-stem reference cards
Compact card per stem: Axis, quality, TPS, TTFT, size, best platform(s).
Numeric encodings (Axis A)
F16/FP16- 16-bit float. Quality: baseline (100%). TPS: slowest. TTFT: neutral. Size: largest (16 bpw). Platform: all. Use as the quality reference / for further quantization.BF16- 16-bit, FP32 exponent range. Quality ≈ FP16, more training-stable. Size 16 bpw. Platform: all (native on modern GPU;mlx-bf16on Mac).FP8(E4M3/E5M2) - 8-bit float. Quality ≈ Q8_0. TPS ~2x BF16 on Hopper/Blackwell. Size ~8 bpw. Platform: NVIDIA Hopper+ (Linux/WSL); emulated elsewhere.Q8_0- 8-bit GGUF. Quality near-lossless (~103% of Q4_K_M). TPS slow-ish. Size ~8.5 bpw. Platform: all. Use when you have memory and want max fidelity locally.Q6_K- 6-bit k-quant. Quality ~102%. Good high-fidelity/size compromise. Platform: all.Q5_K_M/Q5_K_S- 5-bit k-quant (_Mkeeps more high-precision tensors). Quality ~101.5%. Recommended for code/math when you have 12GB+. Platform: all.Q5_0/Q5_1- legacy 5-bit (_1asymmetric, slightly better). Superseded byQ5_K_*.Q4_K_M- 4-bit k-quant, the default sweet spot: ~92-95% quality, ~4.5 bpw, big TPS win. Platform: all. The single best “just give me one” choice for local.Q4_K_S- smaller/slightly lower-quality Q4 k-quant. Platform: all.Q4_0/Q4_1- legacy 4-bit; the canonical QAT export target (Gemma QAT = Q4_0). Quality below Q4_K_M at equal bits unless QAT’d. Platform: all.Q3_K_L/M/S- 3-bit k-quant; visible quality loss; for tight memory. Platform: all.Q2_K- 2-bit; largest quality hit; last resort to fit. Platform: all.NVFP4- NVIDIA 4-bit float, block-16 + FP8 scale. Quality best-in-class for 4-bit (~near FP8 with mixed attention). TPS/TTFT: fastest on Blackwell (4x stair); emulated elsewhere. Size ~4 bpw. Platform: Blackwell GPU natively; Mac via MLX backend (bandwidth win only).MXFP4- OCP 4-bit float, block-32 + power-of-two scale. Quality below NVFP4. Speed like NVFP4 on supporting HW. Size ~4 bpw. Platform: Blackwell-class.MXFP8- OCP 8-bit microscaling (E4M3/E5M2). Quality ≈ FP8. Size ~8 bpw. Platform: Hopper/Blackwell-class; emulated elsewhere.INT8/INT4- uniform integer quant. Simpler than k-quants; INT8 ≈ Q8 quality, INT4 below k-quant Q4 unless QAT’d. Platform: broad (TensorRT/vendor stacks, some GGUF).
Runtime (Axis B)
MLX- Apple-silicon runtime. Best sustained TPS on Mac, weaker TTFT, Apple-only. See §5.
Build-time techniques (Axis C)
QAT,MTP, MoE/A22B,distill,DPO,LASER,YaRN/gradient, MatFormerE2B/E4B, abliteration - see the table in §6.
8. Combination matrix: what stacks and what is mutually exclusive
Legend: ✅ stacks · ❌ mutually exclusive · ➖ same axis, pick one · ⚠️ works but platform/notes apply.
GGUF Q* |
NVFP4/MXFP* |
FP8 |
BF16/F16 |
MLX runtime |
QAT |
MTP |
MoE | distill/DPO/LASER/YaRN |
|
|---|---|---|---|---|---|---|---|---|---|
GGUF Q* |
➖ | ➖ | ➖ | ➖ | ⚠️¹ | ✅ | ✅ | ✅ | ✅ |
NVFP4/MXFP* |
➖ | ➖ | ⚠️² | ⚠️² | ⚠️³ | ✅ | ✅ | ✅ | ✅ |
FP8 |
➖ | ⚠️² | ➖ | ⚠️² | ❌⁴ | ✅ | ✅ | ✅ | ✅ |
BF16/F16 |
➖ | ⚠️² | ⚠️² | ➖ | ✅ (mlx-bf16) |
✅ | ✅ | ✅ | ✅ |
MLX runtime |
⚠️¹ | ⚠️³ | ❌⁴ | ✅ | — | ✅ | ✅ | ✅ | ✅ |
QAT |
✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ | ✅ | ✅ |
MTP |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ | ✅ |
| MoE | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ |
distill/DPO/LASER/YaRN |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (mutually stackable) |
Footnotes:
- GGUF
Q*+ MLX: different quant schemes. MLX doesn’t run aQ4_K_MGGUF; it uses its own 4-bit MLX quant. So “Q4_K_M on MLX” is not a thing - convert the model to MLX quant instead. (GGUF runs on llama.cpp/Ollama; MLX runs MLX-format weights.) - Mixing two numeric formats in one model is normal at the boundary level, not the tensor level: e.g. NVFP4 weights + FP8 or BF16 attention/KV is a recommended Blackwell config. You still pick one format per tensor group; you don’t double-encode a tensor.
- NVFP4/MXFP* + MLX: ✅ via the Ollama 0.19 MLX backend. On M5-class Macs the GPU Neural Accelerators run the 4-bit matmul natively (FP8/INT4 paths), so it is a real compute win (~2x in Ollama’s tests), though there is no Apple equivalent of NVIDIA’s dedicated NVFP4 tensor unit. On M1-M4 the benefit is size/bandwidth only.
- FP8 + MLX: ✅ on M5/A19 GPUs, whose Neural Accelerators execute FP8 natively; on M1-M4 FP8 is emulated. (Separately, NVIDIA’s Hopper/Blackwell FP8 tensor-core path is its own track.)
The governing rule: one Axis-A encoding (per tensor group) + one Axis-B runtime + any stack of Axis-C techniques. Everything labeled ➖ or ❌ above is an attempt to pick two from an axis that only allows one, or to run a format on a runtime that can’t execute it.
9. Worked answers to the example questions
-
“Can I have QAT with Q4_K_M?” ✅ Yes. QAT (Axis C) is orthogonal to the format. You QAT-train the model, then export to a GGUF quant. In practice Google ships QAT as Q4_0; exporting the same QAT checkpoint to Q4_K_M is fine and slightly higher quality than Q4_0. Net: near-FP16 quality at 4-bit size/speed. (Google Gemma 4 QAT, Unsloth QAT)
- “Can I have NVFP4 with QAT and MLX on Ollama?” Partly.
- NVFP4 + QAT: ✅ - QAT (quant-aware distillation) specifically exists to recover NVFP4 accuracy. (QAD for NVFP4, arXiv)
- + MLX on Ollama: ✅ - Ollama 0.19 (preview, 2026-03-30) runs NVFP4 weights through its MLX backend on Apple Silicon. On M5 / M5 Pro / M5 Max the GPU Neural Accelerators speed up both TTFT and decode (Ollama: M5 Max + Qwen3.5-35B-A3B NVFP4, prefill 1,154→1,810 tok/s, decode 58→112 tok/s, ~2x). It is not a byte-for-byte match to Blackwell - Apple has no dedicated NVFP4 tensor unit, and on M1-M4 the gain is memory/bandwidth only - but on M5 it is a real compute win, and a Blackwell NVIDIA GPU on Linux/WSL remains the fastest NVFP4 path overall. (Ollama blog - now powered by MLX, Apple ML Research - LLMs with MLX on M5, Spheron FP4 on Blackwell)
- NVFP4 + MXFP8 + Q4_K_M together? ❌ - all Axis A; pick one encoding for the weights.
- “Best single choice with no other info?”
Q4_K_MGGUF (cross-platform, ~92-95% quality, big TPS/size win). Add QAT if a QAT checkpoint exists. On a Mac doing long generations, consider the MLX build of the same model for higher TPS. On a Blackwell box, prefer NVFP4 (weights) + FP8/BF16 attention for the best speed at near-FP8 quality.
10. Where the ranking depends on other factors
The order is not universal; these factors flip it:
- GPU generation (biggest flip). FP4/FP8 are fastest on NVIDIA Blackwell/Hopper; on
Apple M5/A19 the GPU Neural Accelerators run FP8/INT4 natively (real win), while Ampere,
CPU, and pre-M5 Apple emulate them. On hardware without a matching low-precision unit, a
GGUF
Q4_K_Mout-runs an “emulated NVFP4.” So “is NVFP4 faster than Q4_K_M?” = yes on Blackwell and (with M5 Neural Accelerators) on M5 Macs, often no on older hardware. - TTFT vs TPS objective. If you optimize first-token latency (chat snappiness), runtime + Flash-Attention + compute class dominate, and format barely matters - GGUF/llama.cpp often beats MLX. If you optimize sustained throughput (long generations, batch), lower bits + MLX + MTP win. (prefill vs decode)
- Context length. Long context inflates the KV cache, shifting the bottleneck and rewarding
KV-cache-efficient runtimes (MLX unified memory) and KV quantization;
YaRNis required to go long at all but costs decode speed there. MLX’s long-context advantage also narrows and then reverses past ~30-40K, where its decode runs ~50% slower than llama.cpp + Flash-Attention, because MLX’s attention kernel is not yet IO-aware (FlashAttention-style); an open mlx-lm issue tracks adding it. (mlx-lm issue #763) - Task sensitivity. Math/code/reasoning lose more from aggressive quant than chat/summarize.
For those, step up from
Q4_K_MtoQ5_K_M/Q6_K, or add QAT. (RunAIHome quality loss) - Memory headroom vs model size. MoE flips the size/speed intuition: a 235B-A22B MoE decodes
at ~22B speed but needs ~235B of RAM/VRAM - fast if it fits, unusable if it doesn’t. On
memory-tight machines a dense
Q4_K_Mmay beat an MoE you can’t load. - CPU vs GPU. On CPU, everything is bandwidth-bound, so the lowest-bit k-quant that meets your quality bar maximizes TPS; FP8/FP4/MLX are irrelevant (no path).
- Linux vs WSL2. Compute parity within a few percent; WSL2 trails mainly on cold model-load / disk I/O and at the very top of the VRAM range. Not enough to change format/quant choice.
Verification pass (v2)
This revision re-checked, against current (mid-2026) sources, the four claims that the first version stated tentatively. The earlier hedging came from a model knowledge cutoff that predated these releases; each is now confirmed (or corrected) by primary/independent sources. Only the sections affected by these four items were changed from v1; the rest of the document is unchanged.
| # | Claim re-verified | Verdict | What changed vs v1 |
|---|---|---|---|
| 1 | Ollama has an MLX backend that uses NVFP4 on Apple Silicon | Confirmed | Ollama 0.19 preview (2026-03-30) rebuilt the Mac stack on MLX, uses NVFP4, keeps llama.cpp for Linux/Windows; stated as fact (no “in this dataset’s world”). |
| 2 | FP4 (NVFP4/MXFP4) landed in llama.cpp | Confirmed | NVFP4 merged (GGML_TYPE_NVFP4=40), Blackwell tensor-core dispatch in PR #22196; non-Blackwell gets memory savings only; MXFP4 in ik_llama.cpp. |
| 3 | “Apple GPUs lack FP4 tensor cores ⇒ bandwidth-only” | Corrected | M5/A19 GPUs add GPU Neural Accelerators with native FP8/INT4 matmul, so on M5 the 4-bit path is a real compute win (~2x in Ollama’s tests). No dedicated NVFP4 unit; M1-M4 remain emulated/bandwidth-only. |
| 4 | MLX long-context decode falls ~50% behind llama.cpp (“version-dependent”) | Confirmed, with cause | Documented root cause: MLX’s attention kernel is not yet IO-aware (FlashAttention-style); open mlx-lm issue #763 tracks it. Replaced the vague hedge with the mechanism. |
How the pass was run: targeted web searches for each item (Ollama 0.19 release notes + blog, llama.cpp NVFP4/MXFP4 PRs, Apple M5 GPU Neural Accelerator documentation, and MLX long-context benchmarks/issues), cross-checking a primary source (vendor blog, GitHub PR/issue, or Apple ML Research) against at least one independent benchmark or write-up before changing the text. Sources for the re-verified facts are cited inline in the affected sections and listed below.
Sources added in this verification pass
- Ollama blog - now powered by MLX on Apple Silicon (preview)
- MacRumors - Ollama now runs faster on Macs thanks to MLX
- andrew.ooo - Ollama 0.19 MLX review (2x faster on Apple Silicon)
- RunAIHome - Ollama MLX on Apple Silicon in 2026
- QUASA - Ollama full MLX support: 2x speedups + NVIDIA-quality 4-bit
- llama.cpp PR #19769 - add NVFP4 quantization type
- NVIDIA Dev Forums - llama.cpp native MXFP4 for Blackwell PR
- llama.cpp Discussion #22498 - MXFP6 to improve NVFP4
- Apple ML Research - Exploring LLMs with MLX and the M5 GPU Neural Accelerators
- tzakharko - Investigating the GPU Neural Accelerators on A19/M5
- TechBoards - Apple A19/M5 GPU Neural Accelerators
- Skorppio - Apple M5 Max vs NVIDIA DGX Spark LLM benchmark
- arXiv - Orion: Characterizing Apple’s Neural Engine for LLM training and inference
- mlx-lm Issue #763 - long-context token generation ~50% lower than llama.cpp
Bibliography
GGUF quant performance & quality
- Ollama Quantization Benchmark: q4_K_M vs q8_0 vs q5_K_M Throughput (Markaicode)
- Ollama CPU Benchmark: Tokens per Second by Quantization (Markaicode)
- GGUF Quantization: Quality vs Speed on Consumer GPUs (dasroot)
- GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 (Vucense)
- Q4_K_M vs Q5_K_M vs Q8 - Which GGUF Quantization? (WillItRunAI)
- Q4 vs Q5 vs Q6 vs Q8: Real Quality Loss Numbers (RunAIHome)
- Ollama Quantization Explained: Q4 vs Q5 vs Q8 (ML Journey)
- Ollama Model Quantization Guide: GGUF & Accuracy Loss (BetterLink/Easton)
- Difference in quantization methods - llama.cpp Discussion #2094
MLX vs llama.cpp / Apple Silicon
- Apple’s MLX Runs Local LLMs 3x Faster Than llama.cpp - Until 40K Context (Towards AI)
- Benchmarking Apple’s MLX vs. llama.cpp (Andreas Kunar, Medium)
- Ollama vs llama.cpp vs MLX with Qwen3.5 35B on Apple Silicon (Ante Kapetanovic)
- llama.cpp vs MLX vs Ollama vs vLLM: Apple Silicon 2026 (Contra Collective)
- MLX vs llama.cpp on Apple Silicon, M5 Neural Accelerators, why Ollama switched (yage.ai)
- Apple Silicon LLM Inference Optimization Guide (Starmorph)
- Performance of llama.cpp on Apple Silicon M-series - Discussion #4167
- GGUF vs MLX Quantization Formats on Apple Silicon (Contra Collective)
- GGUF vs MLX: A Deep Dive Into LLM Model Formats (ThinkSmart)
- Ollama Goes MLX (Sebastian Gingter)
- ml-explore/mlx (GitHub)
NVFP4 / MXFP / FP8 / Blackwell
- Introducing NVFP4 for Efficient and Accurate Low-Precision Inference (NVIDIA)
- NVIDIA Blackwell: The Impact of NVFP4 For LLM Inference (Edge AI and Vision)
- FP4 Quantization on Blackwell GPUs: Throughput, Cost, When It’s Worth It (Spheron)
- FP4 vs FP8 vs FP16 LLM Inference: Quality and Speed Tradeoffs (iFactory)
- Microbenchmarking NVIDIA’s Blackwell Architecture (arXiv)
- FP8 Training Infrastructure (Introl)
- FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 (InsiderLLM)
- Ollama Quantization (DeepWiki)
QAT and combination
- Gemma 4 with quantization-aware training (Google)
- Quantization-Aware Training (QAT) (Unsloth)
- Quantization-Aware Training for LLMs with PyTorch (PyTorch)
- How Quantization-Aware Training Enables Low-Precision Accuracy Recovery (NVIDIA)
- Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery (arXiv)
- NeMo Framework QAT for Llama2 SFT (NVIDIA)
TTFT / TPS / prefill-decode fundamentals
- Prefill Is Compute-Bound, Decode Is Memory-Bound (Towards Data Science)
- Prefill vs Decode: LLM Inference Phases Explained (Redis)
- Prefill-decode disaggregation (BentoML LLM Inference Handbook)
- Why LLM Inference Is Memory-Bound (Not Compute-Bound) (Medium)
Discussion