Instructions to use ronantakizawa/molmo-72b-awq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ronantakizawa/molmo-72b-awq with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ronantakizawa/molmo-72b-awq", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ronantakizawa/molmo-72b-awq", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ronantakizawa/molmo-72b-awq with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ronantakizawa/molmo-72b-awq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ronantakizawa/molmo-72b-awq",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ronantakizawa/molmo-72b-awq

SGLang

How to use ronantakizawa/molmo-72b-awq with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ronantakizawa/molmo-72b-awq" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ronantakizawa/molmo-72b-awq",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ronantakizawa/molmo-72b-awq" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ronantakizawa/molmo-72b-awq",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ronantakizawa/molmo-72b-awq with Docker Model Runner:
```
docker model run hf.co/ronantakizawa/molmo-72b-awq
```

Molmo-72B AWQ 4-bit (Text-Only Quantization)

This is a 4-bit AWQ quantized version of allenai/Molmo-72B-0924 using LLM Compressor

Key Features

✅ Qwen2-72B text decoder quantized (4-bit AWQ) - 72% size reduction
✅ OpenAI CLIP vision encoder preserved (FP16) - maintains visual quality
✅ State-of-the-art VLM performance - among the best open VLMs
✅ Smart quantization - Only LLM layers quantized, vision parts untouched
✅ vLLM compatible - Fast inference with vLLM
✅ Trained on PixMo - 1M curated image-text pairs

Model Details

Base Model: allenai/Molmo-72B-0924 (73B parameters)
Architecture: Molmo (Qwen2-72B decoder + OpenAI CLIP vision encoder)
Quantization Method: AWQ (Activation-aware Weight Quantization)
Quantization Scheme: W4A16 (4-bit weights, 16-bit activations)
Calibration Dataset: Flickr30k (512 samples)

Size Comparison

Metric	Value
Original (FP16)	~145.0 GB
Quantized (W4A16)	~37.78 GB
Reduction	~73.9%
Memory Saved	~107.2 GB

What Was Quantized

Quantized (4-bit):

Qwen2-72B decoder layers (text/language model)
Text processing linear layers in the decoder

Preserved (FP16):

OpenAI CLIP vision encoder (maintains visual understanding quality)
Vision-text connectors
Embeddings
Language model head

This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.

About Molmo-72B

Molmo-72B is one of the most powerful open vision-language models:

Text Decoder: Qwen2-72B (state-of-the-art 72B LLM)
Vision Encoder: OpenAI CLIP (proven vision backbone)
Training Data: PixMo - 1 million highly-curated image-text pairs
Performance: Competitive with GPT-4V on many benchmarks

Usage

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained(
    "ronantakizawa/molmo-72b-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

model = AutoModelForCausalLM.from_pretrained(
    "ronantakizawa/molmo-72b-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# Process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
    text="Describe what you see in this image."
)

# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# Generate output
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

# Decode the generated tokens
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

Quantization Details

Method: AWQ (Activation-aware Weight Quantization)
Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization
Calibration: 512 Flickr30k image-text pairs
Max Sequence Length: 2048 tokens
Why AWQ: Activation-aware quantization preserves important weights

Limitations

May have slight quality degradation in complex text generation compared to FP16
Vision encoder is NOT quantized (intentional for quality)
Requires vLLM or transformers with AWQ support

Important Notes

Transparent Images

Ensure images are in RGB format:

from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
    image = image.convert("RGB")

License

Apache 2.0 (same as base model)

Citation

@misc{molmo-72b-awq,
  title={Molmo-72B AWQ 4-bit},
  author={Quantized by ronantakizawa},
  year={2025},
  url={https://huggingface.co/ronantakizawa/molmo-72b-awq-w4a16}
}

Acknowledgements

Base model by Allen Institute for AI
Quantization using LLM Compressor

🤖 Generated with LLM Compressor

Downloads last month: 10

Safetensors

Model size

11B params

Tensor type

I64

I32

F16

Model tree for ronantakizawa/molmo-72b-awq

Base model

Qwen/Qwen2-72B

Finetuned

allenai/Molmo-72B-0924

Quantized

(5)

this model