Stem axis classification and combinability (generated)
Generated by model_glossary_run_06_combination_matrix.py.
Axis A - numeric encodings (pick one)
| Encoding |
~bits/weight |
~quality vs FP16 |
native platform |
F16/FP16 |
16.0 |
100% |
all |
BF16 |
16.0 |
100% |
all |
FP8 |
8.0 |
99% |
NVIDIA Hopper+ |
MXFP8 |
8.0 |
99% |
Hopper/Blackwell-class |
Q8_0 |
8.5 |
99% |
all (GGUF) |
Q6_K |
6.5 |
98% |
all (GGUF) |
Q5_K_M |
5.5 |
97% |
all (GGUF) |
Q5_K_S |
5.4 |
96% |
all (GGUF) |
Q5_0/Q5_1 |
5.5 |
95% |
all (GGUF, legacy) |
Q4_K_M |
4.5 |
94% |
all (GGUF) |
Q4_K_S |
4.3 |
92% |
all (GGUF) |
Q4_0/Q4_1 |
4.3 |
90% |
all (GGUF, legacy; QAT target) |
NVFP4 |
4.0 |
93% |
NVIDIA Blackwell (native) |
MXFP4 |
4.1 |
90% |
Blackwell-class |
INT8 |
8.0 |
98% |
broad (vendor stacks) |
INT4 |
4.0 |
88% |
broad (vendor stacks) |
Q3_K_L |
3.5 |
84% |
all (GGUF) |
Q3_K_M |
3.4 |
82% |
all (GGUF) |
Q3_K_S |
3.3 |
80% |
all (GGUF) |
Q2_K |
2.6 |
70% |
all (GGUF) |
Axis B - runtime / container (pick one)
| Runtime |
platform |
MLX |
Apple Silicon only |
GGUF/llama.cpp |
all (Mac/Linux/WSL/CPU) |
TensorRT/vLLM |
NVIDIA GPU (Linux/WSL) |
Axis C - build-time techniques (stack any)
| Technique |
effect |
QAT |
recovers quant loss (+1-3% vs PTQ); orthogonal to format |
MTP |
speculative-decode head; ~1.8x TPS where runtime supports |
MoE/A22B |
decode at ACTIVE params; RAM/disk at TOTAL params |
distill |
smaller student; faster+smaller, small quality gap vs teacher |
DPO |
preference alignment; no size/speed change |
LASER |
SVD rank-reduction on select layers; task-dependent quality |
YaRN/gradient |
context extension; costs decode speed + KV at long ctx |
MatFormer E2B/E4B |
nested submodel size; stored size > effective (PLE) |
abliteration/uncensored |
removes refusals; behavior change only |
Selected cross-stem combination checks
| A |
B |
result |
why |
QAT |
Q4_K_M |
OK (stacks) |
orthogonal: stacks |
QAT |
Q4_0/Q4_1 |
OK (stacks) |
orthogonal: stacks |
QAT |
NVFP4 |
OK (stacks) |
orthogonal: stacks |
NVFP4 |
MLX |
WARN (works, see note) |
stored only; Apple has no FP4/FP8 tensor cores (no compute speedup) |
NVFP4 |
GGUF/llama.cpp |
OK (stacks) |
GGUF carries this encoding |
NVFP4 |
Q4_K_M |
X (mutually exclusive) |
same axis (encoding): pick one per tensor group |
MXFP8 |
MLX |
WARN (works, see note) |
stored only; Apple has no FP4/FP8 tensor cores (no compute speedup) |
FP8 |
MLX |
X (mutually exclusive) |
FP8 tensor-core exec is an NVIDIA path, not MLX |
Q4_K_M |
MLX |
WARN (works, see note) |
MLX uses its own quant, not GGUF Q*; convert to MLX quant |
BF16 |
MLX |
OK (stacks) |
MLX runs bf16/f16 (e.g. mlx-bf16) |
MTP |
Q4_K_M |
OK (stacks) |
orthogonal: stacks |
MTP |
NVFP4 |
OK (stacks) |
orthogonal: stacks |
MoE/A22B |
Q4_K_M |
OK (stacks) |
orthogonal: stacks |
MoE/A22B |
NVFP4 |
OK (stacks) |
orthogonal: stacks |
distill |
Q4_K_M |
OK (stacks) |
orthogonal: stacks |
YaRN/gradient |
MLX |
OK (stacks) |
orthogonal: stacks |
NVFP4 |
MXFP4 |
X (mutually exclusive) |
same axis (encoding): pick one per tensor group |
Q8_0 |
Q4_K_M |
X (mutually exclusive) |
same axis (encoding): pick one per tensor group |
FP8 |
NVFP4 |
X (mutually exclusive) |
same axis (encoding): pick one per tensor group |
Discussion