Instructions to use Daemontatox/Zirel-3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Daemontatox/Zirel-3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Daemontatox/Zirel-3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Daemontatox/Zirel-3")
model = AutoModelForCausalLM.from_pretrained("Daemontatox/Zirel-3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Daemontatox/Zirel-3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Daemontatox/Zirel-3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Daemontatox/Zirel-3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Daemontatox/Zirel-3

SGLang

How to use Daemontatox/Zirel-3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Daemontatox/Zirel-3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Daemontatox/Zirel-3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Daemontatox/Zirel-3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Daemontatox/Zirel-3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use Daemontatox/Zirel-3 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Daemontatox/Zirel-3 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Daemontatox/Zirel-3 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Daemontatox/Zirel-3 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Daemontatox/Zirel-3",
    max_seq_length=2048,
)

Docker Model Runner
How to use Daemontatox/Zirel-3 with Docker Model Runner:
```
docker model run hf.co/Daemontatox/Zirel-3
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Daemontatox/Zirel-3

Model Description

Zirel-3 is a specialized finetune of cerebras/GLM-4.5-Air-REAP-82B-A12B, a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.

Base Model: GLM-4.5-Air-REAP-82B-A12B

The base model is a compressed variant of GLM-4.5-Air that:

Maintains near-identical performance while being 25% lighter (compressed from 110B to 82B total parameters)
Uses 82B parameters (~12B activated per forward pass)
Employs the REAP pruning method which outperforms expert merging, especially on generative tasks
Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
Achieves drop-in compatibility with vanilla vLLM (no custom patches required)

REAP Technology

REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:

Prunes low-impact experts based on router gate values and expert activation norms
Preserves the router's independent control over remaining experts
Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
Maintains 95-97% of baseline model quality even at high compression ratios

Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression (Lasby et al., 2025)

Zirel-3 Finetune

This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.

Model Specifications

Total Parameters: 82B parameters (12 active)
Architecture: Sparse Mixture-of-Experts (SMoE)
Context Length: 128K tokens
Precision: BF16/FP16 compatible
License: MIT

Usage

Installation

pip install transformers torch vllm

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Prepare conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)

# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended for Production)

vLLM provides significantly faster inference with built-in optimizations for MoE models:

# Serve the model
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --dtype bfloat16

Python Client:

from openai import OpenAI

# Connect to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Create completion
response = client.chat.completions.create(
    model="Daemontatox/Zirel-3",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to implement binary search."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Streaming Response

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Explain quantum computing"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

generation_kwargs = dict(
    inputs=inputs['input_ids'],
    streamer=streamer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for new_text in streamer:
    print(new_text, end='', flush=True)

vLLM Advanced Configuration

# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --swap-space 16 \
    --disable-log-requests

# For low memory situations
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --max-num-seqs 32 \
    --max-model-len 16384 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85

Batch Processing Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Batch of prompts
prompts = [
    "Explain machine learning",
    "Write a sorting algorithm",
    "What is the capital of France?"
]

# Convert to chat format
conversations = [
    [{"role": "user", "content": prompt}] for prompt in prompts
]

# Apply chat template to all
texts = [
    tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
    for conv in conversations
]

# Tokenize with padding
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=2048
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)

# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n{'-'*50}")

Limitations

This is a large MoE model requiring substantial compute resources
Performance may vary based on hardware and optimization settings
May inherit biases present in training data
Requires careful prompt engineering for optimal results

Citation

If you use this model, please cite both the base model and the REAP paper:

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{zirel3,
  title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
  author={Daemontatox},
  year={2025},
  howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}

Acknowledgments

This model builds upon:

Cerebras Research for the REAP compression method and GLM-4.5-Air-REAP base model
Original GLM-4.5-Air by Zhipu AI
The open-source AI community for tooling and infrastructure