There were all sorts of issues with various libraries, and things got pretty tangled up
, so I decided to run some tests here first to sort them out:
I did a bit more validation and I think it is useful to keep the current state split into separate buckets. This follows the same triage style as John6666’s earlier issue-draft post in this thread, but adds fresh local validation for the merge/parity bucket.
Short version
I would currently separate the situation into these issue families:
-
Primary original issue from this thread
PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base still looks like its own split/offload problem. -
Already fixed or release-improved pieces
Some_is_hf_initialized/ bnb parameter-moving behavior appears fixed or improved upstream. Gemma 4device_map="auto"support also appears improved in recent Transformers releases. -
Open PEFT target-module issue
Gemma4ClippableLinearis still a real PEFT target-module blocker for broad-target Gemma 4 adapters. -
Separate local finding: direct 4-bit merge parity
Directmerge_and_unload()into a bnb 4-bit base still did not reproduce adapter-loaded output in my latest local validation. Reloading the base in bf16, loading the same adapter, and merging there still restored output parity.
So I would not collapse all of these into one GitHub issue.
1. Primary original issue: split/offloaded bnb 4-bit Gemma 4 + PEFT adapter load
I still think the primary issue from this thread is best described as:
PEFT adapter loading fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.
The important contrast is:
all-GPU bnb 4-bit + PEFT:
works / can work
CPU/GPU split-dispatched bnb 4-bit + PEFT:
fails in offload / dispatch / hook / quant-state paths
I would not file this as simply:
CPU offload is broken
or:
PEFT + bitsandbytes is broken
Those are too broad. The all-GPU path can work.
A better primary issue title, if this is filed, would be something like:
PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks
I would probably file this first under huggingface/transformers, while cross-linking PEFT, Accelerate, and bitsandbytes, because the failure crosses model integration, device_map/offload behavior, adapter loading, dispatch hooks, and bnb quantized state.
Related public trackers that seem relevant but not identical:
- Transformers #43873 — offloading not working as expected with quantization
- PEFT #3169 — LoRA + BnB INT8 + CPU offload wrong device
- PEFT PR #3181 — normalize output device for CPU-offloaded BnB layers
- Transformers #45482 — Gemma4 cross-device CPU offload errors
2. _is_hf_initialized looks partly fixed, but it is not the whole story
There is a closed Transformers issue for the _is_hf_initialized family:
There is also a merged Accelerate PR:
That PR is described as fixing issues when trying to move weights with bnb.
So I would classify _is_hf_initialized as:
fixed_or_release_improved_subproblem
But I would not say the original Forum issue is solved just because that subpath improved. The split/offload path still has other failure modes, especially meta tensor / dispatch / quant-state / cross-device behavior.
3. Gemma4ClippableLinear is still a separate current PEFT blocker
There is already an open PEFT issue for this:
I re-checked this in a v3.8 local validation run with current packages:
| package | version |
|---|---|
| torch | 2.10.0+cu128 |
| transformers | 5.6.2 |
| accelerate | 1.13.0 |
| peft | 0.19.1 |
| bitsandbytes | 0.49.2 |
| GPU | Tesla T4 |
The broad-target adapter used for this check was:
The target scan found:
| item | count |
|---|---|
Gemma4ClippableLinear hits |
148 |
Linear4bit hits |
205 |
| total target matches | 353 |
Adapter load still failed with:
GEMMA4_CLIPPABLE_LINEAR_UNSUPPORTED
So the current classification from this lane is:
GEMMA4_CLIPPABLE_LINEAR_STILL_BLOCKS
This should stay separate from the split/offload issue and also separate from merge-output parity. It is an adapter-target / module-type compatibility issue. Broad targets such as q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj can hit Gemma 4 wrapper modules that current PEFT does not accept.
For reporting purposes, I would use PEFT #3129 as the main tracker and not create a duplicate unless maintainers want a separate report with scanner output.
Reference docs for why broad target_modules matter:
4. New local validation: direct bnb 4-bit merge still diverges
This is not the original CPU-offload issue. It is a separate PEFT / bitsandbytes / QLoRA workflow finding.
I re-ran the local parity check in v3.8 with:
- base: unsloth/gemma-4-E2B-it
- adapter: fulvian/gemma-4-e2b-medical-qlora-adapter
Direct 4-bit merge lane
4-bit base + adapter-loaded inference:
adapter output differs from base output
direct merge_and_unload() into bnb 4-bit base:
merged output does not match adapter-loaded output
The v3.8 classification was:
DIRECT_4BIT_MERGE_STILL_DIVERGES
Prompt-level summary:
| classification | count |
|---|---|
DIRECT_4BIT_MERGED_MATCHES_NEITHER |
2 |
DIRECT_4BIT_MERGED_MATCHES_BASE |
1 |
Prompt table:
| prompt | classification | base != adapter | adapter == merged | base == merged |
|---|---|---|---|---|
p01_lora_short |
DIRECT_4BIT_MERGED_MATCHES_NEITHER |
True |
False |
False |
p02_hypertension_short |
DIRECT_4BIT_MERGED_MATCHES_BASE |
True |
False |
True |
p03_medical_tutor |
DIRECT_4BIT_MERGED_MATCHES_NEITHER |
True |
False |
False |
The direct 4-bit merge emitted 525 warnings:
Merge lora module to 4-bit linear may get different generations due to rounding errors.
Interpretation:
- the adapter is active, because adapter-loaded output differs from the base output;
- direct merge into the bnb 4-bit base does not reproduce adapter-loaded output;
- in this run, direct merged output was base-like for one prompt and a third output for two prompts.
Fresh bf16 base merge lane
Then I used a fresh non-quantized bf16 base:
bf16_base = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_ID,
device_map={"": 0},
torch_dtype=torch.bfloat16,
)
bf16_peft = PeftModel.from_pretrained(bf16_base, ADAPTER_ID)
merged_bf16 = bf16_peft.merge_and_unload()
The v3.8 classification was:
FRESH_BF16_MERGE_STILL_PASSES
Prompt-level summary:
| classification | count |
|---|---|
BF16_MERGED_MATCHES_BF16_ADAPTER |
3 |
Prompt table:
| prompt | classification | bf16 adapter == bf16 merged |
|---|---|---|
p01_lora_short |
BF16_MERGED_MATCHES_BF16_ADAPTER |
True |
p02_hypertension_short |
BF16_MERGED_MATCHES_BF16_ADAPTER |
True |
p03_medical_tutor |
BF16_MERGED_MATCHES_BF16_ADAPTER |
True |
The fresh bf16 merge emitted 0 merge warnings.
Interpretation:
direct bnb 4-bit merge:
adapter-loaded output not reproduced
fresh bf16 base merge:
adapter-loaded output reproduced
So this looks less like “the adapter is bad” and more like a direct bnb 4-bit merge path issue.
Related issue:
That issue is close because it tracks the warning, but this local validation adds:
- cross-prompt output divergence;
- fresh bf16 merge parity control;
- current PEFT / Transformers / bnb stack confirmation.
I would file this only if maintainers want a separate PEFT issue. It should not be mixed into the original CPU/GPU split-offload issue.
Possible title:
Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity
Relevant docs:
5. In-place dequantize is not the path I would recommend
I also tested the in-place dequantize path earlier:
peft_model.dequantize()
peft_model.merge_and_unload()
In that lane:
dequantize(): PASS
merge_and_unload(): FAIL
AttributeError: 'Parameter' object has no attribute 'quant_state'
So I would not recommend presenting in-place dequantize as the clean solution.
The safer path, based on the local result, is:
reload base in bf16/fp16
load adapter there
merge there
validate output parity
This is still resource-dependent. It worked on T4 for the small E2B validation, but that should not be generalized to all Gemma 4 sizes or longer generations.
Related Forum discussion:
6. What I would file / not file
File or keep tracking
A. Primary original split/offload issue
Target:
huggingface/transformers
Suggested title:
PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks
Cross-link:
huggingface/pefthuggingface/acceleratebitsandbytes-foundation/bitsandbytes
B. Gemma4ClippableLinear
Use existing issue:
Add scanner/preflight evidence there if useful.
C. bnb CPU-offload wrong-device family
Use existing issue / PR:
D. direct 4-bit merge-output parity
Maybe a separate PEFT issue, if maintainers prefer:
Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity
Do not file as-is
I would avoid filing these broad claims:
CPU offload is broken
PEFT + bnb 4-bit is broken
QLoRA adapters are broken
Gemma 4 adapters all fail
merge_and_unload is generally broken
Those claims are too broad and do not match the evidence.
7. Reference links
Original Forum context
- Original HF Forum thread: CPU offloading error scenario
- Earlier John6666-style issue draft / triage post
- HF Forum QLoRA merge / dequantization discussion
GitHub trackers
- PEFT #3129 — Add support for Gemma4ClippableLinear
- PEFT #3169 — LoRA + BnB INT8 + CPU offload wrong device
- PEFT PR #3181 — normalize output device for CPU-offloaded BnB layers
- PEFT #2321 — 4-bit Linear merge warning / different generations
- Transformers #43872 —
_is_hf_initialized/ Int8Params incompatibility - Transformers #43873 — offloading not working as expected with quantization
- Transformers #45482 — Gemma4 cross-device CPU offload errors
- Transformers PR #45347 — Gemma4 device_map auto fix
- Transformers PR #45312 — Gemma4 KV-state sharing/cache fix
- Accelerate PR #3976 — Fix
_is_hf_initializedattribute
Docs / model links
- PEFT LoRA API docs —
target_modules - Transformers bitsandbytes quantization docs
- Base model: unsloth/gemma-4-E2B-it
- Parity adapter: fulvian/gemma-4-e2b-medical-qlora-adapter
- Broad-target adapter: Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill
- Optional clean split-check adapter: welyjesch/filipino_Gemma4_E2B_FT_lora
Bottom line
My current interpretation is:
The original thread is still best treated as a split/offload PEFT adapter-loading problem on already-dispatched bnb 4-bit Gemma 4 models.
Some related pieces have improved upstream, especially _is_hf_initialized and Gemma4 device_map support, but that does not fully close the split/offload issue.
Gemma4ClippableLinear remains a separate PEFT target-module issue.
Direct merge into bnb 4-bit is a separate merge-output parity issue: in local v3.8 validation it still diverged from adapter-loaded output, while fresh bf16 base merge restored parity.
Small clarification
Small clarification on the local validation labels above:
When I wrote “v3.8 local validation,” that is just my local validation-bundle label. It is not an upstream release version, and I do not mean to imply that the exact notebook/package structure is important.
The only evidence I intended to carry forward from that run is the high-level classification:
- direct bnb 4-bit merge did not reproduce adapter-loaded output;
- fresh bf16-base merge did reproduce adapter-loaded output;
- broad Gemma 4 targets still hit
Gemma4ClippableLinearand failed as an unsupported PEFT target; - the optional split-dispatch lane was not re-run in that validation.
So for the original thread issue, I would still treat the existing CPU/GPU offload traces here as the main evidence. The local merge/parity check is a separate bucket.
If maintainers want a repro, I can extract a minimal script for whichever bucket is most useful:
- split/offload adapter-load failure;
Gemma4ClippableLineartarget-module failure;- direct 4-bit merge-output parity divergence.