Instructions to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq

SGLang

How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with Docker Model Runner:
```
docker model run hf.co/dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq
```

This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-8B-Instruct model. We provide two versions:

Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/

Model Size

Models	fp16	HQQ 4-bit/gs-64	AWQ 4-bit	GPTQ 4-bit
Bitrate (Linear layers)	16	4.5	4.25	4.25
VRAM (GB)	15.7	6.1	6.3	5.7

Model Decoding Speed

Models	fp16	HQQ 4-bit/gs-64	AWQ 4-bit	GPTQ 4-bit
Decoding* - short seq (tokens/sec)	53	125	67	3.7
Decoding* - long seq (tokens/sec)	50	97	65	21

*: RTX 3090

Performance

Models	fp16	HQQ 4-bit/gs-64 (no calib)	HQQ 4-bit/gs-64 (calib)	AWQ 4-bit	GPTQ 4-bit
ARC (25-shot)	60.49	60.32	60.92	57.85	61.18
HellaSwag (10-shot)	80.16	79.21	79.52	79.28	77.82
MMLU (5-shot)	68.98	67.07	67.74	67.14	67.93
TruthfulQA-MC2	54.03	53.89	54.11	51.87	53.58
Winogrande (5-shot)	77.98	76.24	76.48	76.4	76.64
GSM8K (5-shot)	75.44	71.27	75.36	73.47	72.25
Average	69.51	68.00	69.02	67.67	68.23
Relative performance	100%	97.83%	99.3%	97.35%	98.16%

You can reproduce the results above via pip install lm-eval==0.4.3

Usage

First, install the dependecies:

pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you want to use the bitblas backend

Also, make sure you use at least torch 2.4.0 or the nightly build with at least CUDA 12.1.

Then you can use the sample code below:

import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Settings
###################################################
backend       = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit)
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
device        = 'cuda:0'
cache_dir     = '.'

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version

model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

#Use optimized inference kernels
###################################################
prepare_for_inference(model, backend=backend) 

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

Downloads last month: 10

Collection including dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq

Llama3 HQQ

Collection

7 items • Updated Nov 3, 2025