Gemma 4 A4B 98-Expert v5-coder (20.8B) — code-leaning prune

A research checkpoint that takes 98e v4 and replaces its drop map with C6 layer-relevance-weighted v4-floor breadth=50 — a recipe that protects code/math experts more tightly per-layer than v4's multi-class CD-max. No shared FFN scaling. Same 98e shape, same router, same attention, same norms.

Quantized formats

Format Repo Notes
GGUF (llama.cpp / ollama) ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF Full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included.
NVFP4A16 (vLLM) ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 ~13 GB, native vLLM, produced via modelopt==0.43.0.
Ollama mannix/gemma4-98e-v5-coder Same GGUF tier sweep, ready for ollama pull.
98e v4 98e v5-coder (this model)
Total params 20.8B ~20.8B
Experts per layer 98 (30 dropped) 98 (30 dropped)
Drop map multi-class CD-map (max), p16 C6 layer-relevance-weighted v4-floor, breadth=50
Shared FFN α 1.0 1.0 (none)

Eval status — complete (9/9). ARC-Challenge was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) → 95.31 % ±0.62 pp, retiring the prior ⚠ from the silent-empty Fix-A pathology. The original 12.37 % was 87.6 % content="" responses because lm-eval's stock openai_completions.parse_generations didn't fall back to reasoning_content.

Scoreboard — NVFP4A16, vLLM, greedy

NVFP4A16 quant via nvidia-modelopt 0.43.0, served via vLLM 0.20.2 with --reasoning-parser gemma4, enable_thinking=true, thinking_token_budget=12288. Sampler: greedy (T=0, top_p=1, top_k=0, do_sample=false) — the canonical Gemma 4 9-bench recipe.

Bench (n) 128e ref 98e v4 98e v5-coder Δ (v5-coder − v4)
ARC-Challenge-chat (1172) 95.99% 95.99% 95.31% −0.68
GPQA Diamond flex (198) 73.23% 69.19% 68.69% −0.50
GSM8K-100 flex 91.00% 86.00% 86.00% 0.00
MATH-500-100 math_verify 89.00% 89.00% 92.00% +3.00
AIME 2024 (30) 36.67% 36.67% 36.67% 0.00
IFEval-100 (prompt_strict) 95.00% 93.00% 94.00% +1.00
HumanEval-164 chat 96.95% 96.95% 98.17% +1.22
HumanEval+-164 chat 92.07% 91.46% 92.68% +1.22
LCB-medium-55 v4 87.27% 78.18% 85.45% (47/55) +7.27

Reading the deltas: v5-coder is a deliberate code-leaning rewrite of v4's drop ranking. The C6 drop map protects per-layer code-relevance signal harder than v4's CD-max aggregation does — that shows up cleanly as +1.22 / +1.22 / +7.27 on the three code benches (HE / HE+ / LCB-medium), with MATH-500 also recovering +3.00pp (math-on-text is correlated with code reasoning more than v4's drop assumed). Reasoning and general-knowledge benches are essentially flat: GPQA −0.50pp, GSM8K 0.00, AIME 0.00, IFEval +1.00. The big win is LCB-medium +7.27pp — that's well outside the ±2pp single-run noise floor on a 55-problem bench and matches the recipe's design intent (preserve code-specialist experts at the cost of nothing).

ARC's prior −83.6pp gap (12.37% vs 95.99%) was not a v5-coder regression — it was the silent-empty Fix-A bug on the unpatched pod that ran it. 87.6% of the 1,172 ARC samples came back with empty content because vLLM 0.20.2 + Gemma 4 + reasoning-parser routes the answer to reasoning_content when the closing channel token isn't seen, and lm-eval's stock parse_generations reads content only. The model itself was fine; the eval harness wasn't patched. Stack-pinned rescore 2026-05-18 landed at 95.31 % — exactly inside the predicted 95–97 % band and within stderr of 128e (95.99 %) and v4 (95.99 %).

HumanEval / HumanEval+ sanity audit

98.17% / 92.68% sits at the top of the 14–22B band (see lazy comparison below), so the samples files were re-audited to rule out a scoring artifact. Audit script: scripts/audit_v5coder_he.py.

Bench n score empty fenced chars p10/p50/p90 verbatim-canonical-in-gen
HumanEval-164 164 0.9817 1 163 270 / 642 / 1324 3.0% (5/164)
HumanEval+-164 164 0.9268 3 161 270 / 620 / 1244 1.8% (3/164)
  • Fences are stripped correctly. 163/164 (HE) and 161/164 (HE+) outputs are wrapped in \``pythonchat fences. lm-eval's chat-aware HE/HE+ scorer (built on thehumaneval_chatshadow task — see [v4 card §Eval Caveat](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it)) extracts the function body beforeexec(). If fences were leaking through, pass@1 would collapse to ~0 (per [feedback_gemma4_chat_only_completions_breaks.md`](https://github.com/mann1x/omnimergekit/blob/main/memory/feedback_gemma4_chat_only_completions_breaks.md)); 98.17% is only possible with correct stripping.
  • Empty-response rate is normal. 1/164 HE and 3/164 HE+ are blank — within Gemma 4 reasoning-mode noise; not the 87.6%-empty pathology that hit ARC.
  • No catastrophic contamination. Only 3.0% (HE) and 1.8% (HE+) of generations contain the canonical solution as a verbatim substring. A model that had memorized HE from pretraining would show 30%+; the few verbatim matches here are short structurally-inevitable solutions (e.g. has_close_elements O(n²) double-loop).
  • HE → HE+ delta is healthy. −5.49pp drop across the +/− boundary. HE+ adds adversarial test cases that catch brittle solutions which pass the public tests but fail edge cases. A 0pp drop would actually be a memorization red flag; ~−5pp is the expected band for a strong-but-not-memorized model.
  • Failures look real. The 3 HE failures are 1 empty (doc_id 122) plus 2 wrong-logic attempts (doc_id 140 fix_spaces, doc_id 145 order_by_points). The 12 HE+ failures are mostly "passes basic tests, fails edge cases" — exactly the regime HE+ exists to expose.

Conclusion: 98.17 / 92.68 is real. Not a scoring artifact, not memorization, not silent-empty.

Lazy comparison vs the 14–22B coder field

For sense-of-scale on whether 98.17 is anomalous. All numbers are official model-card / paper / blog (linked).

Coder-specialized 14–22B:

Model Params HE HE+ LCB (version) Source
98e v5-coder (this) 20.8B / 4B MoE 98.17 92.68 85.45 (LCB-medium-55 v4) this card
Qwen2.5-Coder-14B-Instruct 14.7B dense 89.6 87.2 23.4 (LCB 07/24–11/24, pre-v4) arXiv:2409.12186
DeepSeek-Coder-V2-Lite-Instruct 16B / 2.4B MoE 81.1 24.3 (LCB 12/01–06/01) arXiv:2406.11931
Codestral-22B v1 22B dense 81.1 (not published) Mistral blog
IBM Granite-20B-Code-Instruct 20B dense 60.4 arXiv:2405.04324

Generalist 14–22B (notable code scores):

Model Params HE MATH GPQA-D IFEval Source
98e v5-coder (this) 20.8B / 4B MoE 98.17 92.00 (MATH-500) 68.69 94.00 this card
Phi-4 14B dense 82.6 80.4 (MATH) 56.1 63.0 arXiv:2412.08905
Qwen2.5-14B-Instruct 14.7B dense 81.7–86.2 73.0 (MATH) 40.9 80.0 Qwen blog
Mistral-Small-3 (24B, just above band) 24B dense ~84 70.6 (MATH) 45.3 82.1 Mistral blog

Where v5-coder sits:

  • HE / HE+: top of the band, ~+8–10pp above Qwen2.5-Coder-14B's 89.6 / 87.2 (the published field leader). The audit above rules out scoring artifacts; the gap is real on this run.
  • LCB: not apples-to-apples with Qwen2.5-Coder or DS-Coder-V2-Lite. Those numbers are full LCB on pre-v4 problem windows (LCB-2024.07–11 and LCB-2024.12–06.01 respectively). v5-coder's 85.45% is LCB-medium-55 on v4 problems — a different subset and a different problem set. A fair comparison would require running Qwen2.5-Coder-14B on the same LCB-medium-55 v4 split, which nobody has published. Don't read +60pp into the LCB column.
  • MATH-500 92.00 / GSM8K 86 / AIME 36.67: top of the band for math-on-text reasoning. Phi-4's 80.4 MATH is the closest generalist; v5-coder beats it by ~12pp. AIME 36.67 is currently the only published 14–22B AIME score in this comparison set (Qwen2.5-Coder and Codestral don't evaluate AIME).
  • GPQA-Diamond 68.69 / IFEval 94.00: GPQA is materially above Phi-4 (56.1) and Qwen2.5-14B (40.9). IFEval 94 ties Mistral-Small-3 (82.1) and beats Phi-4 (63.0) — Phi-4's instruction-following is its known weakness.

Caveats on this comparison: different labs use different system prompts, different temperature/top_p, different "chat vs base" framings, different sampling counts. v5-coder is run greedy (T=0); some published numbers (e.g. Phi-4) use multi-sample averaging. Within-card deltas (v4 vs v5-coder) are the cleanest signal; cross-card deltas are noisy by ±2-5pp.

Same-Stack GGUF HE+ Sweep — v5-coder vs Qwen2.5-Coder-14B-Instruct

Head-to-head HumanEval+ (164-question, chat-aware shadow task) on identical hardware (single RTX 3090 24 GB) and identical eval recipe (llama-server -c 32768 -ngl 99 --parallel 2 --jinja --reasoning off, omk_eval llama backend, lm-eval humaneval_plus_chat, greedy T=0, max_gen_toks=16384). Qwen GGUFs are bartowski's Qwen2.5-Coder-14B-Instruct-GGUF.

The "Lazy comparison" table above uses paper-reported numbers; this section is what the same rig and same scorer actually measure.

v5-coder (20.8B total / 4B-active MoE) — plain quants

Tier File size bpw HE+ pass@1
Q2_K 8.40 GB 3.23 6.10% (collapse)
Q3_K_M 10.51 GB 4.04 84.15%
Q4_K_M 13.24 GB 5.09 92.07%
Q5_K_M 15.07 GB 5.80 90.85%
Q6_K 17.81 GB 6.85 92.07%

Q4_K_M is the recommended sweet spot. Q3_K_M loses ~8pp but is still usable; Q2_K collapses (an MoE-class artifact, not a v5-coder regression — plain Q2_K bytes are the cohort floor).

Qwen2.5-Coder-14B-Instruct (14.7B dense) — bartowski quants

Tier File size bpw HE+ pass@1
IQ4_XS 8.12 GB 4.42 84.76%
Q4_0 8.54 GB 4.65 84.15%
Q4_K_M 8.99 GB 4.89 85.37%
Q5_K_M 10.51 GB 5.72 83.54%
Q6_K 12.12 GB 6.60 84.76%
Q8_0 15.70 GB 8.54 84.76%

Qwen sits at 83–85% across the whole tier ladder. The paper-reported 87.2 HE+ is ~2pp above what bartowski's GGUFs deliver on this stack — a known llama-server chat-template vs vLLM-temp=0 quirk, not a quant defect.

Head-to-head by file size (v5-coder runs lower bpw at the same disk)

Pairing by tier name is misleading here — v5-coder is a 20.8B-total MoE and Qwen is a 14.7B dense, so the same tier name maps to different file sizes. The fair comparison is iso-disk: at a given GB budget, which model wins HE+? At every band, v5-coder uses 2–3 bpw less than Qwen and still scores higher.

Disk band Qwen2.5-Coder-14B (size / bpw / HE+) v5-coder (size / bpw / HE+) Δ HE+
~15 GB Q8_0 15.70 GB / 8.54 / 84.76% Q5_K_M 15.07 GB / 5.80 / 90.85% +6.09
~12–13 GB Q6_K 12.12 GB / 6.60 / 84.76% Q4_K_M 13.24 GB / 5.09 / 92.07% +7.31
~10.5 GB Q5_K_M 10.51 GB / 5.72 / 83.54% Q3_K_M 10.51 GB / 4.04 / 84.15% +0.61
~9 GB Q4_K_M 8.99 GB / 4.89 / 85.37% Q2_K 8.40 GB / 3.23 / 6.10% (collapse) −79.27

The first three rows are the practical story: at ~15 GB Qwen's near-lossless Q8_0 loses 6pp to v5-coder Q5_K_M; at ~13 GB v5-coder Q4_K_M is +7.3pp over Qwen Q6_K; at ~10.5 GB even v5-coder Q3_K_M edges out Qwen Q5_K_M while running at 1.7 bpw less. The 4th row marks the floor — sub-Q3 MoE quants collapse, so the v5-coder ladder bottoms out at Q3_K_M / ~10 GB.

Pure tier-name matching (Qwen Q4_K_M vs v5-coder Q4_K_M etc.) would put v5-coder ~4 GB larger at every tier and ~+7pp ahead. That comparison is symmetric but unfair to Qwen's smaller footprint. The iso-disk view above is the one to plan VRAM around.

CD- (ContribDynamic) tiers* are intentionally omitted from this table. Those are mid-rebuild after a 2026-05-19 patch closed a --tensor-type-file heuristic gap; they will be added once the rebuilt CD scores are confirmed.

Run logs and samples live under eval_results_hep_sweep/humanevalplus_full/ in the project tree.

What changed vs v4 (mechanical detail)

Identical surgery flow to v4 with one substitution — a different drop map.

  1. Same base: google/gemma-4-26B-A4B-it (128e).
  2. Same drop count: 30 experts per layer (98 retained).
  3. Same protect_top=16 shield.
  4. Different ranking signal: instead of score[layer][expert] = max over normalized classes (wnorm·α + tc) (v4), v5-coder scores each expert by a layer-relevance-weighted floor against v4's keep set, with breadth=50 controlling how many top-relevance experts get the floor lift before the bottom-30 cutoff is taken. The recipe scripts live in omnimergekit (T25 / T28 / T30 / C6 series — see feedback_* memory for the ablation history).
  5. Same downstream: slice expert tensors, resize MoE router proj.weight from [128, hidden] → [98, hidden], update config.json: num_experts=98, GGUF conversion + quant pipeline unchanged.

No shared FFN scaling (verified: layers.0.mlp.down_proj.weight is byte-for-byte identical to v4 in BF16).

When to pick which 98e variant

Variant Lean Pick when
v3 pooled TF (no class signal) reference baseline; you want the original v3 numbers
v4 balanced (5-class CD-map) general-purpose; first 98e you'd default to
v5 v4 + shared FFN α=1.2 when you want v4 with a louder expert-mixture residual (research checkpoint)
v5-coder (this) code-leaning C6 floor HumanEval / HumanEval+ / LCB / MultiPL-E workloads; +7pp on LCB-medium

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v5-coder-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 — FA2 not supported
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v5-coder-it")

msgs = [{"role": "user", "content": "Write a Python function that reverses a binary tree in-place."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=2048, do_sample=False)  # greedy, canonical recipe
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

vLLM (NVFP4A16, canonical eval recipe)

python -m vllm.entrypoints.openai.api_server \
    --model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
    --served-model-name 98e_v5_coder_nvfp4a16 \
    --port 8099 \
    --gpu-memory-utilization 0.55 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --default-chat-template-kwargs '{"enable_thinking": true}'

llama.cpp (GGUF)

llama-server -m gemma-4-A4B-98e-v5-coder-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --jinja --reasoning-format deepseek --reasoning-budget 12288 \
    --temp 0 --top-p 1 --top-k 0

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF — full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) plus 5 ContribDynamic CD-* per-layer quants (CD-Q2_K, CD-Q3_K_M, CD-Q4_K_M, CD-Q5_K_M, CD-Q6_K). File naming: gemma-4-A4B-98e-v5-coder-it-<TIER>.gguf.

ollama

ollama pull mannix/gemma4-98e-v5-coder:Q6_K
ollama run mannix/gemma4-98e-v5-coder:Q6_K

Available tiers: mannix/gemma4-98e-v5-coder — same set as the GGUF repo (Q2_KQ8_0, IQ2_*IQ4_*, CD-Q2_KCD-Q6_K). Modelfile uses Gemma 4 tool/parser template (matches mannix/gemma4-98e-v4 convention).

Related Models

Model Description
gemma-4-A4B-98e-v5-coder-NVFP4A16 NVFP4A16 quant (~13 GB, vLLM-ready)
gemma-4-A4B-98e-v5-coder-it-GGUF GGUF tier sweep + CD per-layer quants (llama.cpp / ollama)
mannix/gemma4-98e-v5-coder (Ollama) Ollama-published version of the GGUF tier sweep
gemma-4-A4B-98e-v4-it The apples-to-apples baseline for this model
gemma-4-A4B-98e-v5-it Sibling: v4 + shared FFN α=1.2
gemma-4-A4B-98e-v3-it Earlier baseline (pooled TF map)

Recipe + Code

OmniMergeKit is the canonical home. The relevant artifacts for this model:

  • scripts/v5coder_C6_v4floor_perlayer_breadth50_drop_map.json — the drop map (embedded in expert_drop_metadata.json).
  • scripts/expert_drop.py — drop applier (unchanged across v3/v4/v5/v5-coder).
  • eval/EVAL_PROTOCOL.md — locked greedy methodology for the 9-bench suite, including the mandatory Fix-A patch for lm-eval's openai_completions.parse_generations (without it, ARC and other chat-completions benches silent-empty under Gemma 4 + reasoning-parser).

Eval Caveat — Fix-A is mandatory

This model was evaluated on a pod whose lm-eval install was missing the Fix-A reasoning_content fallback patch in openai_completions.parse_generations. Under vLLM 0.20.2 + --reasoning-parser gemma4, Gemma 4 emits the answer to the message's reasoning_content field and leaves content="" whenever the parser doesn't see the closing channel token. Without Fix-A, lm-eval reads only content and scores those responses as empty (= wrong on multiple-choice tasks). On ARC-Challenge this produced 1027 empty / 1172 total → 12.37% pass. On the other 7 benches, the silent-empty rate stayed below 10% (because the prompt templates land in a regime where the model emits a content phase reliably), so their scores are within the canonical band.

The lesson is captured permanently in omnimergekit/eval/EVAL_PROTOCOL.md and the canonical pod bootstrap (pod_setup_eval_envs.sh) auto-applies Fix-A — every new eval pod now starts in the patched state.

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The OmniMergeKit project for the surgery + eval toolkit
  • The vLLM and modelopt teams for the NVFP4A16 serving / quantization pipeline
  • bartowski for the calibration data v5 used in imatrix GGUF quantization
Downloads last month
30
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it

Finetuned
(1)
this model
Quantizations
2 models

Collection including ManniX-ITA/gemma-4-A4B-98e-v5-coder-it

Papers for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it