Llama3 HQQ
Collection
7 items • Updated
How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq", dtype="auto")How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq
How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq with Docker Model Runner:
docker model run hf.co/dropbox-dash/Llama-3.1-8b-instruct_4bitgs64_hqq
This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-8B-Instruct model. We provide two versions:
| Models | fp16 | HQQ 4-bit/gs-64 | AWQ 4-bit | GPTQ 4-bit |
|---|---|---|---|---|
| Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 |
| VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 |
| Models | fp16 | HQQ 4-bit/gs-64 | AWQ 4-bit | GPTQ 4-bit |
|---|---|---|---|---|
| Decoding* - short seq (tokens/sec) | 53 | 125 | 67 | 3.7 |
| Decoding* - long seq (tokens/sec) | 50 | 97 | 65 | 21 |
*: RTX 3090
| Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | AWQ 4-bit | GPTQ 4-bit |
|---|---|---|---|---|---|
| ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
| HellaSwag (10-shot) | 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
| MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
| TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
| Winogrande (5-shot) | 77.98 | 76.24 | 76.48 | 76.4 | 76.64 |
| GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
| Average | 69.51 | 68.00 | 69.02 | 67.67 | 68.23 |
| Relative performance | 100% | 97.83% | 99.3% | 97.35% | 98.16% |
You can reproduce the results above via pip install lm-eval==0.4.3
First, install the dependecies:
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you want to use the bitblas backend
Also, make sure you use at least torch 2.4.0 or the nightly build with at least CUDA 12.1.
Then you can use the sample code below:
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator
#Settings
###################################################
backend = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit)
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
device = 'cuda:0'
cache_dir = '.'
#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
#Use optimized inference kernels
###################################################
prepare_for_inference(model, backend=backend)
#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)