Instructions to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8") model = AutoModelForImageTextToText.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8
- SGLang
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 with Docker Model Runner:
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8
Gemma-4-12B-it AEON Abliterated — K=4 Biprojection (FP8)
The near-lossless 8-bit FP8 quantization of our K=4 biprojection abliteration of
google/gemma-4-12B-it. Matches BF16 capability (MMLU, HumanEval, IFEval all within noise) at ~half the size and 1.6× the throughput. Loads in vLLM with--quantization modelopt. ~13 GB.This is the recommended variant when quality matters. For maximum speed/smallest size (with a measured reasoning trade-off) see the NVFP4 sibling.
Refusal behavior has been removed; the model responds to a wide range of prompts the base would decline. Operator-side safety is your responsibility — see the arbitration clause at the bottom.
🚀 QuickStart
Docker (recommended, DGX Spark / Blackwell)
# 1. Download
huggingface-cli download AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 \
--local-dir ./Gemma-4-12B-AEON-K4-FP8
# 2. Serve
docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \
-v $(pwd)/Gemma-4-12B-AEON-K4-FP8:/model:ro \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /model \
--served-model-name gemma12b \
--quantization modelopt \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
# 3. Call (OpenAI-compatible)
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"gemma12b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
| python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"
Plain vLLM
pip install "vllm>=0.22.2" "nvidia-modelopt>=0.43" "transformers>=5.10"
vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8 \
--quantization modelopt --kv-cache-dtype fp8_e4m3 \
--max-model-len 8192 --max-num-seqs 16 \
--gpu-memory-utilization 0.85 --trust-remote-code
⚠️ Needs vLLM ≥ 0.22.2 (for the
Gemma4UnifiedForConditionalGenerationloader) and an FP8-capable GPU (Hopper H100, or Blackwell GB10 / B100 / B200 / RTX 50-series). The AEON vLLM Ultimate container ships the loader pre-built for DGX Sparksm_121a.
That's it. Everything below is detail.
Why FP8 — measured capability vs BF16
All four axes evaluated through the vLLM serving path (the real thing you'll run), identical prompts/settings for every model.
Full-length eval — MMLU balanced across all 57 subjects (5 each = 285 Q), full 164-problem HumanEval, IFEval-50. All via the vLLM serving path, identical prompts/settings.
| Model | MMLU (285) | HumanEval-syn (164) | HumanEval-fun (164) | IFEval (50) |
|---|---|---|---|---|
google/gemma-4-12B-it (official base) |
81.4% | 99.4% | 82.9% | 90% |
| K4-BF16 (abliterated, full precision) | 80.4% | 99.4% | 83.5% | 90% |
| K4-FP8 (this) | 80.4% | 99.4% | 85.4% | 90% |
| K4-NVFP4 MLP-only (4-bit sibling) | 76.8% | 96.3% | 76.2% | 90% |
FP8 is statistically identical to the BF16 model — same MMLU (80.4% / 229-of-285, exactly), same HumanEval-syntactic, +noise on HumanEval-functional, same IFEval. This is the defining property: 8-bit FP8 is imperceptible from BF16. (For reference, the abliteration itself is capability-neutral too — K4-BF16 is within ~1pp of Google's official base on every axis.)
Throughput (DGX Spark GB10, FP8 KV cache, greedy)
| FP8 (this) | BF16 | NVFP4 | |
|---|---|---|---|
| Concurrent ×16 aggregate | 235 tok/s | 144 tok/s | 341 tok/s |
| Single-stream overall | 15.8 tok/s | 7.7 tok/s | 23.6 tok/s |
| Single-stream TTFT median | 143 ms | — | 102 ms |
| Size | 13 GB | 24 GB | 8.5 GB |
FP8 is 1.6× faster than BF16 at half the memory. NVFP4 is faster still but trades the reasoning quality shown above. Pick FP8 for quality, NVFP4 for max throughput/min size.
Per-category single-stream (FP8)
| Category | decode tok/s | TPOT |
|---|---|---|
| summary | 19.9 | 51.5 ms |
| prose | 18.2 | 54.9 ms |
| dialogue | 16.4 | 60.6 ms |
| code | 14.6 | 68.6 ms |
| reasoning | 14.3 | 72.3 ms |
| math | 11.4 | 89.4 ms |
Quantization methodology
| Property | Value |
|---|---|
| Tool | NVIDIA ModelOpt 0.43.0 |
| Config | FP8_PER_CHANNEL_PER_TOKEN_CFG |
| Format | FP8 E4M3, per-channel weight scales + per-token dynamic activation scales |
| Why per-channel/per-token | Gemma-4 attention activations carry large per-channel outliers; per-channel weight + per-token activation scaling absorbs them (per-tensor FP8 would clip them) |
| Calibration | 512 × CNN/DailyMail validation @ 1024 tokens, native sm_121a |
| Model size | ~13 GB (from 23.9 GB BF16 — 46% reduction) |
| Runtime | vLLM --quantization modelopt via Gemma4UnifiedForConditionalGeneration |
Kept at full BF16
lm_head, model.language_model.embed_tokens, model.embed_vision*, model.embed_audio*, model.vision_embedder*. All language-stack attention + MLP linears (48 layers) are FP8.
vLLM loader note (for reproducers)
Google's Gemma-4-12B is the encoder-free Gemma4UnifiedForConditionalGeneration. ModelOpt's HF export needs two touch-ups to load in vLLM: (1) rename the vision keys to vLLM's vision_embedder.* layout, and (2) add model.vision_embedder* to the quant ignore list so the patch embedder stays BF16. Both are scripted in make_vllm_ready.py (gemma4-nvfp4/). Requires vLLM ≥ 0.22.2.
Abliteration methodology (inherited from the BF16 base)
K=4 multi-direction norm-preserving biprojection (extends TrevorJS's recipe). Basis layers L24/L37/L39/L26 (top-K by SNR), o_proj + mlp.down_proj edited on 24/48 layers, scale=1.0. See the BF16 card for the full biprojection math + capability comparison vs base.
Behavior
- Benign prompts: indistinguishable from BF16 (capability table above confirms it numerically).
- Previously-refused prompts: full responses, usually after a brief disclaimer paragraph.
- Tool calling via
--enable-auto-tool-choice --tool-call-parser gemma4. - Multimodal vision path preserved (BF16).
Available formats
| Variant | Repo | Precision | Size | Pick when |
|---|---|---|---|---|
| FP8 (this) | …-K4-FP8 |
FP8 E4M3 | 13 GB | Quality matters — near-lossless, matches BF16 |
| Mixed NVFP4+FP8 | …-K4-NVFP4-FP8 |
NVFP4 MLP + FP8 attn | 9.3 GB | Smallest + fastest — MLP-only quality, 20% less size, 34% faster |
| NVFP4 MLP-only | …-K4-NVFP4 |
NVFP4 MLP + BF16 attn | 11.7 GB | Superseded by Mixed NVFP4+FP8 (above) |
| BF16 | …-K4-BF16 |
bfloat16 | 24 GB | Fine-tuning, non-Blackwell hardware |
Acknowledgements
TrevorJS (biprojection), p-e-w/heretic (abliteration framework), NVIDIA ModelOpt (FP8 toolkit + Gemma-4 reference recipes), AEON-7 (K-direction extension, FP8 recipe + vLLM loader fixes, capability eval).
License
Inherits the Gemma license.
Arbitration Clause
By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:
Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.
This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
₿ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
◎ Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ⓜ Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 139
Model tree for AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8
Base model
google/gemma-4-12B


