Instructions to use Daemontatox/Zirel-3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Daemontatox/Zirel-3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Daemontatox/Zirel-3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Daemontatox/Zirel-3") model = AutoModelForCausalLM.from_pretrained("Daemontatox/Zirel-3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Daemontatox/Zirel-3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Daemontatox/Zirel-3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Daemontatox/Zirel-3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Daemontatox/Zirel-3
- SGLang
How to use Daemontatox/Zirel-3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Daemontatox/Zirel-3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Daemontatox/Zirel-3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Daemontatox/Zirel-3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Daemontatox/Zirel-3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use Daemontatox/Zirel-3 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Daemontatox/Zirel-3 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Daemontatox/Zirel-3 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Daemontatox/Zirel-3 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Daemontatox/Zirel-3", max_seq_length=2048, ) - Docker Model Runner
How to use Daemontatox/Zirel-3 with Docker Model Runner:
docker model run hf.co/Daemontatox/Zirel-3
Daemontatox/Zirel-3
Model Description
Zirel-3 is a specialized finetune of cerebras/GLM-4.5-Air-REAP-82B-A12B, a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.
Base Model: GLM-4.5-Air-REAP-82B-A12B
The base model is a compressed variant of GLM-4.5-Air that:
- Maintains near-identical performance while being 25% lighter (compressed from 110B to 82B total parameters)
- Uses 82B parameters (~12B activated per forward pass)
- Employs the REAP pruning method which outperforms expert merging, especially on generative tasks
- Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
- Achieves drop-in compatibility with vanilla vLLM (no custom patches required)
REAP Technology
REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:
- Prunes low-impact experts based on router gate values and expert activation norms
- Preserves the router's independent control over remaining experts
- Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
- Maintains 95-97% of baseline model quality even at high compression ratios
Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression (Lasby et al., 2025)
Zirel-3 Finetune
This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.
Model Specifications
- Total Parameters: 82B parameters (12 active)
- Architecture: Sparse Mixture-of-Experts (SMoE)
- Context Length: 128K tokens
- Precision: BF16/FP16 compatible
- License: MIT
Usage
Installation
pip install transformers torch vllm
Inference with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Prepare conversation
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]
# Apply chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1
)
# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
Inference with vLLM (Recommended for Production)
vLLM provides significantly faster inference with built-in optimizations for MoE models:
# Serve the model
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 64 \
--dtype bfloat16
Python Client:
from openai import OpenAI
# Connect to vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Create completion
response = client.chat.completions.create(
model="Daemontatox/Zirel-3",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a Python function to implement binary search."}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
Streaming Response
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [
{"role": "user", "content": "Explain quantum computing"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)
generation_kwargs = dict(
inputs=inputs['input_ids'],
streamer=streamer,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end='', flush=True)
vLLM Advanced Configuration
# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 64 \
--max-model-len 32768 \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--swap-space 16 \
--disable-log-requests
# For low memory situations
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-num-seqs 32 \
--max-model-len 16384 \
--dtype bfloat16 \
--gpu-memory-utilization 0.85
Batch Processing Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Batch of prompts
prompts = [
"Explain machine learning",
"Write a sorting algorithm",
"What is the capital of France?"
]
# Convert to chat format
conversations = [
[{"role": "user", "content": prompt}] for prompt in prompts
]
# Apply chat template to all
texts = [
tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
for conv in conversations
]
# Tokenize with padding
inputs = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=2048
).to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n{'-'*50}")
Limitations
- This is a large MoE model requiring substantial compute resources
- Performance may vary based on hardware and optimization settings
- May inherit biases present in training data
- Requires careful prompt engineering for optimal results
Citation
If you use this model, please cite both the base model and the REAP paper:
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
@misc{zirel3,
title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
author={Daemontatox},
year={2025},
howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}
Acknowledgments
This model builds upon:
- Cerebras Research for the REAP compression method and GLM-4.5-Air-REAP base model
- Original GLM-4.5-Air by Zhipu AI
- The open-source AI community for tooling and infrastructure
License
MIT License - Same as the base model.
- Downloads last month
- 30
Model tree for Daemontatox/Zirel-3
Base model
zai-org/GLM-4.5-Air