Model-ID stem analysis - quality / speed / size rankings and combinability

Jun 29, 2026 #ai #llm #quantization

Model-ID Stem Analysis: quality / speed / size rankings and combinability

A companion to Model-ID glossary. This report analyzes the architectural and quantization stems (Q4_K_M, Q8_0, NVFP4, MXFP8, BF16, FP8, MLX, QAT, MTP, MoE active-params, distill, YaRN, LASER, MatFormer, etc.) and ranks them on:

Quality - generation quality, instruction following, reasoning/math/code fidelity.
Speed - TTFT (time to first token = prefill latency) and TPS (tokens/sec = decode throughput). These two move differently, which is why some rankings split.
Size - on disk and in RAM/VRAM.
across Mac (Apple Silicon), Linux (NVIDIA), and WSL2 (NVIDIA).

Plus a combination matrix: which stems stack and which are mutually exclusive (e.g. can I have QAT with Q4_K_M? NVFP4 with QAT and MLX on Ollama?).

Figures below are drawn from published 2026 benchmarks (linked inline and in the Bibliography). Absolute throughput and the exact ordering depend on the model, context length, batch size, GPU generation, and runtime version; where the order flips, the governing factor is called out.

Model-ID Stem Analysis: quality / speed / size rankings and combinability

1. The one idea that makes this tractable: three orthogonal axes

Every stem in a model tag belongs to exactly one of three independent axes. You pick one value from Axis A and one from Axis B, and stack any number from Axis C:

Axis	What it is	Pick how many	Stems
A. Numeric encoding	how the weights are stored (bits + scheme)	exactly one per tensor group	`F16`/`FP16`, `BF16`, `FP8`, `INT8`/`INT4`, `Q8_0`, `Q6_K`, `Q5_K_M`/`Q5_K_S`/`Q5_0`/`Q5_1`, `Q4_K_M`/`Q4_K_S`/`Q4_0`/`Q4_1`, `Q3_K_L`/`Q3_K_M`/`Q3_K_S`, `Q2_K`, `NVFP4`, `MXFP4`, `MXFP8`
B. Runtime / container	what executes the weights	exactly one at run time	`MLX` (Apple), GGUF/llama.cpp (cross-platform), vendor GPU stacks (TensorRT/vLLM)
C. Build-time technique	how the model was trained/derived	any number	`QAT`, `MTP`, MoE/`A22B`, `distill`, `DPO`, `LASER`, `YaRN`/`gradient`, MatFormer (`E2B`/`E4B`), abliteration/`uncensored`

The single most common confusion - “can I combine Q4_K_M and NVFP4?” - dissolves here: both are Axis A, so no, you choose one. But QAT (Axis C) + Q4_K_M (Axis A) + GGUF (Axis B)? Yes - that is exactly what a Gemma QAT GGUF is. (Google Gemma 4 QAT, Ollama goes MLX)

2. Why TTFT and TPS rank differently

Prefill (reading your prompt, producing the first token) is compute-bound - big matrix-matrix multiplies, high arithmetic intensity. It sets TTFT.
Decode (each subsequent token) is memory-bandwidth-bound - matrix-vector ops that repeatedly stream the weights and a growing KV cache. It sets TPS.

Consequence: quantization mainly buys TPS, not TTFT, because it shrinks bytes moved per token during the bandwidth-bound decode phase; it does little for the compute-bound prefill. A format can therefore win on TPS and lose on TTFT (this is exactly the MLX-vs-llama.cpp story below). (Towards Data Science - prefill compute-bound, decode memory-bound, Redis - Prefill vs Decode, Why LLM inference is memory-bound)

Rules of thumb used throughout:

Disk size ≈ bits-per-weight × parameters. RAM/VRAM ≈ disk + KV cache + runtime overhead.
Lower bits ⇒ smaller + faster decode (higher TPS), at some quality cost.
TTFT is governed far more by runtime, batching, Flash-Attention, and GPU compute than by the weight format.

3. Platform reality: what can even run where

Before ranking, note that some Axis-A/Axis-B choices simply don’t exist on some platforms. Native = dedicated hardware path (fast); emulated = upcast/dequantize in software (works, not fast).

Stem	Mac (Apple Silicon)	Linux (NVIDIA)	WSL2 (NVIDIA)	Pure CPU
GGUF `Q2_K…Q8_0`, `F16`	Native (Metal)	Native (CUDA)	Native (CUDA)	Yes
`BF16`	Yes (Metal/CPU)	Native	Native	Yes (slow)
`FP8`	Native on M5/A19 GPU Neural Accelerators; emulated on M1-M4	Native on Hopper+	Native on Hopper+	No
`NVFP4`	Ollama 0.19 MLX backend; on M5 the GPU Neural Accelerators accelerate the matmul (native FP8/INT4 paths); no dedicated NVFP4 tensor unit; M1-M4 storage/bandwidth only	Native only on Blackwell; emulated on Hopper/Ampere	Native only on Blackwell	No
`MXFP4` / `MXFP8`	M5 GPU Neural Accelerators (native FP8/INT4); M1-M4 emulated	Native on Blackwell-class; else emulated	Native on Blackwell-class	No
`MLX` (runtime)	Apple-silicon only	N/A	N/A	N/A
GGUF runtime (llama.cpp/Ollama)	Yes	Yes	Yes	Yes

Key platform facts:

MLX is Mac-only. There is no MLX on Linux or WSL. (ml-explore/mlx)
NVFP4/MXFP4 are Blackwell-native on NVIDIA. On Hopper/Ampere they are emulated; FP8 needs Hopper+. NVFP4 is merged in llama.cpp (GGML_TYPE_NVFP4 = 40, kernels landed late Mar-Apr 2026; Blackwell tensor-core dispatch in PR #22196); non-Blackwell NVIDIA cards get the memory savings only. MXFP4 (OCP variant) lives in ik_llama.cpp. (NVIDIA Introducing NVFP4, Spheron FP4 on Blackwell, llama.cpp PR #19769 (NVFP4), InsiderLLM FP4 in llama.cpp)
On Apple Silicon, low precision is real on M5, emulated before it. Ollama 0.19 (preview, released 2026-03-30) rebuilds the Mac stack on Apple’s MLX framework and uses NVFP4 for 4-bit, while keeping llama.cpp for Linux/Windows. M5 / M5 Pro / M5 Max (and A19) GPUs add GPU Neural Accelerators - matrix units that natively run FP8 and INT4 - so on M5 the 4-bit path is a genuine compute win (Ollama measured ~2x: e.g. M5 Max + Qwen3.5-35B-A3B NVFP4, prefill 1,154→1,810 tok/s, decode 58→112 tok/s), not just a bandwidth win. There is still no Apple equivalent of NVIDIA’s dedicated NVFP4 tensor unit, and on M1-M4 (no matrix accelerators) the FP4/FP8 benefit is storage/bandwidth only. (Ollama blog - now powered by MLX, MacRumors - Ollama faster on Macs, Apple ML Research - LLMs with MLX on M5, tzakharko - A19/M5 Neural Accelerators benchmark)
Linux vs WSL2: with NVIDIA GPU passthrough, WSL2 CUDA compute is near-native; the practical gaps are slightly higher model-load/disk-I/O latency and occasional driver/VRAM-reporting quirks. Compute-bound TTFT and decode TPS are within a few percent of bare Linux; treat them as equivalent for ranking, with WSL2 a touch behind on cold-start/load and on very large models near the VRAM limit.
Pure CPU (any OS): only GGUF int/float formats; lower bits help most because CPU memory bandwidth is the binding constraint. (Markaicode CPU benchmark)

4. Axis A - numeric encodings ranked

All else equal (same model, same runtime). “Quality” = closeness to FP16 baseline.

4.1 Quality (best → worst)

F16/BF16  ≈  Q8_0  >  Q6_K  >  Q5_K_M > Q5_K_S  >  Q4_K_M > Q4_K_S  >  Q4_0/Q4_1
          >  Q3_K_L > Q3_K_M > Q3_K_S  >  Q2_K

Q8_0 is effectively lossless; Q6_K ~+2% over Q4_K_M baseline, Q5_K_M ~+1.5%, and Q4_K_M retains ~92-95% of FP16 quality - the standard “sweet spot.” Below Q4, quality degrades fast, and Q2_K is noticeably worse (use only when nothing else fits). (RunAIHome Q4/Q5/Q6/Q8 quality loss, WillItRunAI quant guide)
k-quant (_K_) beats legacy (_0/_1) at equal bit-width because of the super-block scale structure; within a level, _M > _S (more tensors kept at higher precision). _1 ≥ _0 (asymmetric: scale + min). (llama.cpp discussion #2094)
FP4 family for quality: NVFP4 > MXFP4 at the same 4 bits (~88% lower quantization error; block-16 + FP8 scale vs block-32 + power-of-two scale). NVFP4 can match/slightly beat FP8 on some reasoning tasks when paired with FP8/BF16 attention (mixed precision), not as pure FP4. (Edge-AI-Vision NVFP4 impact, iFactory FP4 vs FP8 vs FP16)
Rough quality placement of the GPU formats among the GGUF ladder: BF16 ≈ FP16 > FP8 ≈ Q8_0 > Q6_K > NVFP4 ≳ Q5_K_M ≳ MXFP8(4-bit-ish use) ≳ Q4_K_M > MXFP4. (NVFP4 lands near a high-quality 4-5 bit GGUF; MXFP4 near a plain 4-bit.)

4.2 TPS / decode throughput (fastest → slowest)

On a Blackwell GPU, FP4 has dedicated 4x-throughput tensor cores, so:

NVFP4 ≈ MXFP4  >  FP8/MXFP8  >  BF16/F16        (Blackwell: ~4x / ~2x / ~1x stair)

On hardware without an FP4 tensor unit (non-Blackwell NVIDIA, CPU, and pre-M5 Apple) decode is purely bandwidth-bound and TPS tracks fewer bits = faster:

Q2_K > Q3_K_* > Q4_0/Q4_K_S > Q4_K_M > Q5_K_* > Q6_K > Q8_0 > F16/BF16

The fuller picture on Apple Metal (correcting a too-broad statement that earlier lumped all Apple GPUs in as “no FP4 tensor cores, bandwidth-only”): M5 / A19 GPUs add GPU Neural Accelerators that execute FP8 and INT4 natively, so on M5-class Macs low-bit decode gets a real compute speedup (not just a bandwidth win), and the order above understates how fast 4-bit/8-bit run there. Apple still has no dedicated NVFP4 unit like Blackwell’s, so NVFP4 specifically leans on those native FP8/INT4 paths plus memory savings; M1-M4 Macs have no matrix accelerators at all and remain purely bandwidth-bound as the ladder shows. (Apple ML Research - LLMs with MLX on M5, tzakharko - A19/M5 Neural Accelerators benchmark)

Example magnitudes: RTX 4090, Llama-3.2-8B - Q4_K_M 112 tok/s vs Q8_0 83 tok/s (+35%); CPU 12-thread - Q4_K_M 14.2 vs Q5_K_M 8.5 vs Q8_0 6.8 tok/s. (Markaicode Ollama quant benchmark, dasroot GGUF quality vs speed)

4.3 TTFT (best → worst)

Weight format has second-order effect on TTFT (prefill is compute-bound). The first-order levers are GPU compute class and runtime. That said:

On Blackwell, NVFP4/FP8 also give the best TTFT because the prefill matmuls run on the faster low-precision tensor cores. (Edge-AI-Vision)
On Mac/CPU/non-Blackwell, TTFT is roughly format-insensitive; a lower-bit model is not meaningfully faster to first token, and can even be marginally slower if it must dequantize.

4.4 Size on disk and in RAM (smallest → largest)

Tracks bits-per-weight directly:

Q2_K(~2.6) < Q3_K_S/M/L(~3.4) < NVFP4/MXFP4(~4.0-4.25) < Q4_0/Q4_K_S(~4.3) < Q4_K_M(~4.5)
< Q5_K_*(~5.5) < Q6_K(~6.5) < MXFP8/FP8(~8) ≈ Q8_0(~8.5) < BF16/F16(16)

7B reference: FP16 ~13.5 GB → Q4_K_M ~4.1 GB → Q2_K ~2.8 GB. RAM/VRAM ≈ disk + KV cache + overhead (RTX 4090: Q4_K_M 5.8 GB resident vs Q8_0 9.1 GB). On Apple unified memory, “RAM” is the budget for both weights and KV cache. (Vucense GGUF sizes)

5. Axis B - runtime / container: MLX vs GGUF

Same weights, different engine. Mac only (the only platform where you actually choose).

Metric	MLX	GGUF (llama.cpp / Ollama)	Who wins
Decode TPS	~1.4-1.8x raw llama.cpp; bigger lead at long context (KV stays in unified memory)	baseline	MLX for sustained generation
TTFT / prefill	often slower	faster (Metal + Flash-Attention)	GGUF for short prompts / snappy first token
Long context (30K+)	KV-cache-efficient, but decode runs ~50% slower than llama.cpp+FlashAttn because MLX’s attention kernel is not yet IO-aware (FlashAttention-style); an open mlx-lm issue tracks adding it	strong with Flash-Attention	GGUF at long context
Ecosystem & portability	Apple-only; needs MLX conversion	universal GGUF; runs everywhere	GGUF
Quantization scheme	MLX’s own (e.g. 4-bit MLX, `mlx-bf16`)	GGUF `Q*`/`F16`	n/a (different schemes)

Net: on a Mac, MLX for max TPS on long generations; GGUF/Ollama for best TTFT, portability, and the widest model selection. Ollama’s overhead means “MLX is 3x faster” claims are mostly Ollama-vs-MLX, not llama.cpp-vs-MLX. (Towards AI MLX vs llama.cpp, Ante Kapetanovic Qwen3.5 Apple Silicon, yage.ai MLX vs llama.cpp, Contra Collective GGUF vs MLX)

6. Axis C - build-time techniques ranked

These travel with the weights and are largely platform-independent. They do not change the numeric format; they change quality, or (for MoE/MTP) the speed/size math.

Stem	Quality effect	TPS effect	TTFT effect	Size effect	Notes
`QAT`	+ (recovers up to ~70% of quant loss; +1-3% GPQA/MMLU-Pro vs PTQ)	none (same format)	none	none	makes a low-bit model behave like a higher-bit one
`MTP`	neutral	++ (~1.8x via speculative decode, runtime-permitting)	slight +	tiny (extra head)	needs runtime support to realize the speedup
MoE / `A22B`	high quality per active-FLOP	++ (decode runs at active params, e.g. 22B)	+	−− (RAM/disk = total params, e.g. 235B)	great speed/quality, heavy memory
`distill`	slight − vs teacher, big + vs same-size base	+ (smaller student)	+	+ (smaller)	how small reasoning models get R1-like skills
`DPO`	+ (alignment/preference)	none	none	none	post-training preference alignment
`LASER`	+ on targeted tasks (can also regress)	tiny + (low-rank)	none	tiny −	SVD rank-reduction on select layers
`YaRN`/`gradient`	neutral at short ctx; enables long ctx	− at long ctx (more KV)	− at long ctx	KV cache grows	context-window extension
MatFormer `E2B`/`E4B`	E4B > E2B	E2B faster	E2B faster	stored > effective (PLE)	pick a nested submodel size
`uncensored`/abliteration	removes refusals; may slightly dent benchmark quality	none	none	none	behavior change, not size

Sources: Unsloth QAT, NVIDIA QAT accuracy recovery, Sebastian Raschka MTP, Qwen3 Technical Report (MoE), EmergentMind DeepSeek-R1 distilled, YaRN (arXiv), HF MatFormer in Gemma 3n, LASER (arXiv).

7. Per-stem reference cards

Compact card per stem: Axis, quality, TPS, TTFT, size, best platform(s).

Numeric encodings (Axis A)

F16/FP16 - 16-bit float. Quality: baseline (100%). TPS: slowest. TTFT: neutral. Size: largest (16 bpw). Platform: all. Use as the quality reference / for further quantization.
BF16 - 16-bit, FP32 exponent range. Quality ≈ FP16, more training-stable. Size 16 bpw. Platform: all (native on modern GPU; mlx-bf16 on Mac).
FP8 (E4M3/E5M2) - 8-bit float. Quality ≈ Q8_0. TPS ~2x BF16 on Hopper/Blackwell. Size ~8 bpw. Platform: NVIDIA Hopper+ (Linux/WSL); emulated elsewhere.
Q8_0 - 8-bit GGUF. Quality near-lossless (~103% of Q4_K_M). TPS slow-ish. Size ~8.5 bpw. Platform: all. Use when you have memory and want max fidelity locally.
Q6_K - 6-bit k-quant. Quality ~102%. Good high-fidelity/size compromise. Platform: all.
Q5_K_M / Q5_K_S - 5-bit k-quant (_M keeps more high-precision tensors). Quality ~101.5%. Recommended for code/math when you have 12GB+. Platform: all.
Q5_0 / Q5_1 - legacy 5-bit (_1 asymmetric, slightly better). Superseded by Q5_K_*.
Q4_K_M - 4-bit k-quant, the default sweet spot: ~92-95% quality, ~4.5 bpw, big TPS win. Platform: all. The single best “just give me one” choice for local.
Q4_K_S - smaller/slightly lower-quality Q4 k-quant. Platform: all.
Q4_0 / Q4_1 - legacy 4-bit; the canonical QAT export target (Gemma QAT = Q4_0). Quality below Q4_K_M at equal bits unless QAT’d. Platform: all.
Q3_K_L/M/S - 3-bit k-quant; visible quality loss; for tight memory. Platform: all.
Q2_K - 2-bit; largest quality hit; last resort to fit. Platform: all.
NVFP4 - NVIDIA 4-bit float, block-16 + FP8 scale. Quality best-in-class for 4-bit (~near FP8 with mixed attention). TPS/TTFT: fastest on Blackwell (4x stair); emulated elsewhere. Size ~4 bpw. Platform: Blackwell GPU natively; Mac via MLX backend (bandwidth win only).
MXFP4 - OCP 4-bit float, block-32 + power-of-two scale. Quality below NVFP4. Speed like NVFP4 on supporting HW. Size ~4 bpw. Platform: Blackwell-class.
MXFP8 - OCP 8-bit microscaling (E4M3/E5M2). Quality ≈ FP8. Size ~8 bpw. Platform: Hopper/Blackwell-class; emulated elsewhere.
INT8 / INT4 - uniform integer quant. Simpler than k-quants; INT8 ≈ Q8 quality, INT4 below k-quant Q4 unless QAT’d. Platform: broad (TensorRT/vendor stacks, some GGUF).

Runtime (Axis B)

MLX - Apple-silicon runtime. Best sustained TPS on Mac, weaker TTFT, Apple-only. See §5.

Build-time techniques (Axis C)

QAT, MTP, MoE/A22B, distill, DPO, LASER, YaRN/gradient, MatFormer E2B/E4B, abliteration - see the table in §6.

8. Combination matrix: what stacks and what is mutually exclusive

Legend: ✅ stacks · ❌ mutually exclusive · ➖ same axis, pick one · ⚠️ works but platform/notes apply.

	GGUF `Q*`	`NVFP4`/`MXFP*`	`FP8`	`BF16`/`F16`	`MLX` runtime	`QAT`	`MTP`	MoE	`distill`/`DPO`/`LASER`/`YaRN`
*GGUF `Q`**	➖	➖	➖	➖	⚠️¹	✅	✅	✅	✅
*`NVFP4`/`MXFP`**	➖	➖	⚠️²	⚠️²	⚠️³	✅	✅	✅	✅
`FP8`	➖	⚠️²	➖	⚠️²	❌⁴	✅	✅	✅	✅
`BF16`/`F16`	➖	⚠️²	⚠️²	➖	✅ (`mlx-bf16`)	✅	✅	✅	✅
`MLX` runtime	⚠️¹	⚠️³	❌⁴	✅	—	✅	✅	✅	✅
`QAT`	✅	✅	✅	✅	✅	—	✅	✅	✅
`MTP`	✅	✅	✅	✅	✅	✅	—	✅	✅
MoE	✅	✅	✅	✅	✅	✅	✅	—	✅
`distill`/`DPO`/`LASER`/`YaRN`	✅	✅	✅	✅	✅	✅	✅	✅	✅ (mutually stackable)

Footnotes:

GGUF Q* + MLX: different quant schemes. MLX doesn’t run a Q4_K_M GGUF; it uses its own 4-bit MLX quant. So “Q4_K_M on MLX” is not a thing - convert the model to MLX quant instead. (GGUF runs on llama.cpp/Ollama; MLX runs MLX-format weights.)
Mixing two numeric formats in one model is normal at the boundary level, not the tensor level: e.g. NVFP4 weights + FP8 or BF16 attention/KV is a recommended Blackwell config. You still pick one format per tensor group; you don’t double-encode a tensor.
NVFP4/MXFP* + MLX: ✅ via the Ollama 0.19 MLX backend. On M5-class Macs the GPU Neural Accelerators run the 4-bit matmul natively (FP8/INT4 paths), so it is a real compute win (~2x in Ollama’s tests), though there is no Apple equivalent of NVIDIA’s dedicated NVFP4 tensor unit. On M1-M4 the benefit is size/bandwidth only.
FP8 + MLX: ✅ on M5/A19 GPUs, whose Neural Accelerators execute FP8 natively; on M1-M4 FP8 is emulated. (Separately, NVIDIA’s Hopper/Blackwell FP8 tensor-core path is its own track.)

The governing rule: one Axis-A encoding (per tensor group) + one Axis-B runtime + any stack of Axis-C techniques. Everything labeled ➖ or ❌ above is an attempt to pick two from an axis that only allows one, or to run a format on a runtime that can’t execute it.

9. Worked answers to the example questions

“Can I have QAT with Q4_K_M?” ✅ Yes. QAT (Axis C) is orthogonal to the format. You QAT-train the model, then export to a GGUF quant. In practice Google ships QAT as Q4_0; exporting the same QAT checkpoint to Q4_K_M is fine and slightly higher quality than Q4_0. Net: near-FP16 quality at 4-bit size/speed. (Google Gemma 4 QAT, Unsloth QAT)
“Can I have NVFP4 with QAT and MLX on Ollama?” Partly.
- NVFP4 + QAT: ✅ - QAT (quant-aware distillation) specifically exists to recover NVFP4 accuracy. (QAD for NVFP4, arXiv)
- + MLX on Ollama: ✅ - Ollama 0.19 (preview, 2026-03-30) runs NVFP4 weights through its MLX backend on Apple Silicon. On M5 / M5 Pro / M5 Max the GPU Neural Accelerators speed up both TTFT and decode (Ollama: M5 Max + Qwen3.5-35B-A3B NVFP4, prefill 1,154→1,810 tok/s, decode 58→112 tok/s, ~2x). It is not a byte-for-byte match to Blackwell - Apple has no dedicated NVFP4 tensor unit, and on M1-M4 the gain is memory/bandwidth only - but on M5 it is a real compute win, and a Blackwell NVIDIA GPU on Linux/WSL remains the fastest NVFP4 path overall. (Ollama blog - now powered by MLX, Apple ML Research - LLMs with MLX on M5, Spheron FP4 on Blackwell)
- NVFP4 + MXFP8 + Q4_K_M together? ❌ - all Axis A; pick one encoding for the weights.
“Best single choice with no other info?” Q4_K_M GGUF (cross-platform, ~92-95% quality, big TPS/size win). Add QAT if a QAT checkpoint exists. On a Mac doing long generations, consider the MLX build of the same model for higher TPS. On a Blackwell box, prefer NVFP4 (weights) + FP8/BF16 attention for the best speed at near-FP8 quality.

10. Where the ranking depends on other factors

The order is not universal; these factors flip it:

GPU generation (biggest flip). FP4/FP8 are fastest on NVIDIA Blackwell/Hopper; on Apple M5/A19 the GPU Neural Accelerators run FP8/INT4 natively (real win), while Ampere, CPU, and pre-M5 Apple emulate them. On hardware without a matching low-precision unit, a GGUF Q4_K_M out-runs an “emulated NVFP4.” So “is NVFP4 faster than Q4_K_M?” = yes on Blackwell and (with M5 Neural Accelerators) on M5 Macs, often no on older hardware.
TTFT vs TPS objective. If you optimize first-token latency (chat snappiness), runtime + Flash-Attention + compute class dominate, and format barely matters - GGUF/llama.cpp often beats MLX. If you optimize sustained throughput (long generations, batch), lower bits + MLX + MTP win. (prefill vs decode)
Context length. Long context inflates the KV cache, shifting the bottleneck and rewarding KV-cache-efficient runtimes (MLX unified memory) and KV quantization; YaRN is required to go long at all but costs decode speed there. MLX’s long-context advantage also narrows and then reverses past ~30-40K, where its decode runs ~50% slower than llama.cpp + Flash-Attention, because MLX’s attention kernel is not yet IO-aware (FlashAttention-style); an open mlx-lm issue tracks adding it. (mlx-lm issue #763)
Task sensitivity. Math/code/reasoning lose more from aggressive quant than chat/summarize. For those, step up from Q4_K_M to Q5_K_M/Q6_K, or add QAT. (RunAIHome quality loss)
Memory headroom vs model size. MoE flips the size/speed intuition: a 235B-A22B MoE decodes at ~22B speed but needs ~235B of RAM/VRAM - fast if it fits, unusable if it doesn’t. On memory-tight machines a dense Q4_K_M may beat an MoE you can’t load.
CPU vs GPU. On CPU, everything is bandwidth-bound, so the lowest-bit k-quant that meets your quality bar maximizes TPS; FP8/FP4/MLX are irrelevant (no path).
Linux vs WSL2. Compute parity within a few percent; WSL2 trails mainly on cold model-load / disk I/O and at the very top of the VRAM range. Not enough to change format/quant choice.

Verification pass (v2)

This revision re-checked, against current (mid-2026) sources, the four claims that the first version stated tentatively. The earlier hedging came from a model knowledge cutoff that predated these releases; each is now confirmed (or corrected) by primary/independent sources. Only the sections affected by these four items were changed from v1; the rest of the document is unchanged.

#	Claim re-verified	Verdict	What changed vs v1
1	Ollama has an MLX backend that uses NVFP4 on Apple Silicon	Confirmed	Ollama 0.19 preview (2026-03-30) rebuilt the Mac stack on MLX, uses NVFP4, keeps llama.cpp for Linux/Windows; stated as fact (no “in this dataset’s world”).
2	FP4 (NVFP4/MXFP4) landed in llama.cpp	Confirmed	NVFP4 merged (`GGML_TYPE_NVFP4=40`), Blackwell tensor-core dispatch in PR #22196; non-Blackwell gets memory savings only; MXFP4 in ik_llama.cpp.
3	“Apple GPUs lack FP4 tensor cores ⇒ bandwidth-only”	Corrected	M5/A19 GPUs add GPU Neural Accelerators with native FP8/INT4 matmul, so on M5 the 4-bit path is a real compute win (~2x in Ollama’s tests). No dedicated NVFP4 unit; M1-M4 remain emulated/bandwidth-only.
4	MLX long-context decode falls ~50% behind llama.cpp (“version-dependent”)	Confirmed, with cause	Documented root cause: MLX’s attention kernel is not yet IO-aware (FlashAttention-style); open mlx-lm issue #763 tracks it. Replaced the vague hedge with the mechanism.

How the pass was run: targeted web searches for each item (Ollama 0.19 release notes + blog, llama.cpp NVFP4/MXFP4 PRs, Apple M5 GPU Neural Accelerator documentation, and MLX long-context benchmarks/issues), cross-checking a primary source (vendor blog, GitHub PR/issue, or Apple ML Research) against at least one independent benchmark or write-up before changing the text. Sources for the re-verified facts are cited inline in the affected sections and listed below.

Model-ID stem analysis - quality / speed / size rankings and combinability

Model-ID Stem Analysis: quality / speed / size rankings and combinability

Contents

1. The one idea that makes this tractable: three orthogonal axes

2. Why TTFT and TPS rank differently

3. Platform reality: what can even run where

4. Axis A - numeric encodings ranked

4.1 Quality (best → worst)

4.2 TPS / decode throughput (fastest → slowest)

4.3 TTFT (best → worst)

4.4 Size on disk and in RAM (smallest → largest)

5. Axis B - runtime / container: MLX vs GGUF

6. Axis C - build-time techniques ranked

7. Per-stem reference cards

Numeric encodings (Axis A)

Runtime (Axis B)

Build-time techniques (Axis C)

8. Combination matrix: what stacks and what is mutually exclusive

9. Worked answers to the example questions

10. Where the ranking depends on other factors

Verification pass (v2)

Sources added in this verification pass

Bibliography

GGUF quant performance & quality

MLX vs llama.cpp / Apple Silicon

NVFP4 / MXFP / FP8 / Blackwell

QAT and combination

TTFT / TPS / prefill-decode fundamentals

Architectural techniques (MoE / MTP / distill / YaRN / MatFormer / LASER)

Discussion