← All posts

Model glossary run 06 - Stem axis classification and combinability

Stem axis classification and combinability (generated)

Generated by model_glossary_run_06_combination_matrix.py.

Axis A - numeric encodings (pick one)

Encoding ~bits/weight ~quality vs FP16 native platform
F16/FP16 16.0 100% all
BF16 16.0 100% all
FP8 8.0 99% NVIDIA Hopper+
MXFP8 8.0 99% Hopper/Blackwell-class
Q8_0 8.5 99% all (GGUF)
Q6_K 6.5 98% all (GGUF)
Q5_K_M 5.5 97% all (GGUF)
Q5_K_S 5.4 96% all (GGUF)
Q5_0/Q5_1 5.5 95% all (GGUF, legacy)
Q4_K_M 4.5 94% all (GGUF)
Q4_K_S 4.3 92% all (GGUF)
Q4_0/Q4_1 4.3 90% all (GGUF, legacy; QAT target)
NVFP4 4.0 93% NVIDIA Blackwell (native)
MXFP4 4.1 90% Blackwell-class
INT8 8.0 98% broad (vendor stacks)
INT4 4.0 88% broad (vendor stacks)
Q3_K_L 3.5 84% all (GGUF)
Q3_K_M 3.4 82% all (GGUF)
Q3_K_S 3.3 80% all (GGUF)
Q2_K 2.6 70% all (GGUF)

Axis B - runtime / container (pick one)

Runtime platform
MLX Apple Silicon only
GGUF/llama.cpp all (Mac/Linux/WSL/CPU)
TensorRT/vLLM NVIDIA GPU (Linux/WSL)

Axis C - build-time techniques (stack any)

Technique effect
QAT recovers quant loss (+1-3% vs PTQ); orthogonal to format
MTP speculative-decode head; ~1.8x TPS where runtime supports
MoE/A22B decode at ACTIVE params; RAM/disk at TOTAL params
distill smaller student; faster+smaller, small quality gap vs teacher
DPO preference alignment; no size/speed change
LASER SVD rank-reduction on select layers; task-dependent quality
YaRN/gradient context extension; costs decode speed + KV at long ctx
MatFormer E2B/E4B nested submodel size; stored size > effective (PLE)
abliteration/uncensored removes refusals; behavior change only

Selected cross-stem combination checks

A B result why
QAT Q4_K_M OK (stacks) orthogonal: stacks
QAT Q4_0/Q4_1 OK (stacks) orthogonal: stacks
QAT NVFP4 OK (stacks) orthogonal: stacks
NVFP4 MLX WARN (works, see note) stored only; Apple has no FP4/FP8 tensor cores (no compute speedup)
NVFP4 GGUF/llama.cpp OK (stacks) GGUF carries this encoding
NVFP4 Q4_K_M X (mutually exclusive) same axis (encoding): pick one per tensor group
MXFP8 MLX WARN (works, see note) stored only; Apple has no FP4/FP8 tensor cores (no compute speedup)
FP8 MLX X (mutually exclusive) FP8 tensor-core exec is an NVIDIA path, not MLX
Q4_K_M MLX WARN (works, see note) MLX uses its own quant, not GGUF Q*; convert to MLX quant
BF16 MLX OK (stacks) MLX runs bf16/f16 (e.g. mlx-bf16)
MTP Q4_K_M OK (stacks) orthogonal: stacks
MTP NVFP4 OK (stacks) orthogonal: stacks
MoE/A22B Q4_K_M OK (stacks) orthogonal: stacks
MoE/A22B NVFP4 OK (stacks) orthogonal: stacks
distill Q4_K_M OK (stacks) orthogonal: stacks
YaRN/gradient MLX OK (stacks) orthogonal: stacks
NVFP4 MXFP4 X (mutually exclusive) same axis (encoding): pick one per tensor group
Q8_0 Q4_K_M X (mutually exclusive) same axis (encoding): pick one per tensor group
FP8 NVFP4 X (mutually exclusive) same axis (encoding): pick one per tensor group

Discussion

← All posts