CPU offloading error scenario

John6666 · April 26, 2026, 11:13pm

There were all sorts of issues with various libraries, and things got pretty tangled up , so I decided to run some tests here first to sort them out:

I did a bit more validation and I think it is useful to keep the current state split into separate buckets. This follows the same triage style as John6666’s earlier issue-draft post in this thread, but adds fresh local validation for the merge/parity bucket.

Short version

I would currently separate the situation into these issue families:

Primary original issue from this thread
PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base still looks like its own split/offload problem.
Already fixed or release-improved pieces
Some _is_hf_initialized / bnb parameter-moving behavior appears fixed or improved upstream. Gemma 4 device_map="auto" support also appears improved in recent Transformers releases.
Open PEFT target-module issue
Gemma4ClippableLinear is still a real PEFT target-module blocker for broad-target Gemma 4 adapters.
Separate local finding: direct 4-bit merge parity
Direct merge_and_unload() into a bnb 4-bit base still did not reproduce adapter-loaded output in my latest local validation. Reloading the base in bf16, loading the same adapter, and merging there still restored output parity.

So I would not collapse all of these into one GitHub issue.

1. Primary original issue: split/offloaded bnb 4-bit Gemma 4 + PEFT adapter load

I still think the primary issue from this thread is best described as:

PEFT adapter loading fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.

The important contrast is:

all-GPU bnb 4-bit + PEFT:
  works / can work

CPU/GPU split-dispatched bnb 4-bit + PEFT:
  fails in offload / dispatch / hook / quant-state paths

I would not file this as simply:

CPU offload is broken

or:

PEFT + bitsandbytes is broken

Those are too broad. The all-GPU path can work.

A better primary issue title, if this is filed, would be something like:

PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks

I would probably file this first under huggingface/transformers, while cross-linking PEFT, Accelerate, and bitsandbytes, because the failure crosses model integration, device_map/offload behavior, adapter loading, dispatch hooks, and bnb quantized state.

Related public trackers that seem relevant but not identical:

2. `_is_hf_initialized` looks partly fixed, but it is not the whole story

There is a closed Transformers issue for the _is_hf_initialized family:

Transformers #43872 — bitsandbytes incompatibility: Int8Params.__new__() got unexpected _is_hf_initialized

There is also a merged Accelerate PR:

Accelerate PR #3976 — Fix _is_hf_initialized attribute

That PR is described as fixing issues when trying to move weights with bnb.

So I would classify _is_hf_initialized as:

fixed_or_release_improved_subproblem

But I would not say the original Forum issue is solved just because that subpath improved. The split/offload path still has other failure modes, especially meta tensor / dispatch / quant-state / cross-device behavior.

3. `Gemma4ClippableLinear` is still a separate current PEFT blocker

There is already an open PEFT issue for this:

PEFT #3129 — Add support for Gemma4ClippableLinear / Gemma 4 QLoRA fails

I re-checked this in a v3.8 local validation run with current packages:

package	version
torch	`2.10.0+cu128`
transformers	`5.6.2`
accelerate	`1.13.0`
peft	`0.19.1`
bitsandbytes	`0.49.2`
GPU	`Tesla T4`

The broad-target adapter used for this check was:

Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill

The target scan found:

item	count
`Gemma4ClippableLinear` hits	`148`
`Linear4bit` hits	`205`
total target matches	`353`

Adapter load still failed with:

GEMMA4_CLIPPABLE_LINEAR_UNSUPPORTED

So the current classification from this lane is:

GEMMA4_CLIPPABLE_LINEAR_STILL_BLOCKS

This should stay separate from the split/offload issue and also separate from merge-output parity. It is an adapter-target / module-type compatibility issue. Broad targets such as q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj can hit Gemma 4 wrapper modules that current PEFT does not accept.

For reporting purposes, I would use PEFT #3129 as the main tracker and not create a duplicate unless maintainers want a separate report with scanner output.

Reference docs for why broad target_modules matter:

PEFT LoRA API docs — target_modules

4. New local validation: direct bnb 4-bit merge still diverges

This is not the original CPU-offload issue. It is a separate PEFT / bitsandbytes / QLoRA workflow finding.

I re-ran the local parity check in v3.8 with:

base: unsloth/gemma-4-E2B-it
adapter: fulvian/gemma-4-e2b-medical-qlora-adapter

Direct 4-bit merge lane

4-bit base + adapter-loaded inference:
  adapter output differs from base output

direct merge_and_unload() into bnb 4-bit base:
  merged output does not match adapter-loaded output

The v3.8 classification was:

DIRECT_4BIT_MERGE_STILL_DIVERGES

Prompt-level summary:

classification	count
`DIRECT_4BIT_MERGED_MATCHES_NEITHER`	`2`
`DIRECT_4BIT_MERGED_MATCHES_BASE`	`1`

Prompt table:

prompt	classification	base != adapter	adapter == merged	base == merged
`p01_lora_short`	`DIRECT_4BIT_MERGED_MATCHES_NEITHER`	`True`	`False`	`False`
`p02_hypertension_short`	`DIRECT_4BIT_MERGED_MATCHES_BASE`	`True`	`False`	`True`
`p03_medical_tutor`	`DIRECT_4BIT_MERGED_MATCHES_NEITHER`	`True`	`False`	`False`

The direct 4-bit merge emitted 525 warnings:

Merge lora module to 4-bit linear may get different generations due to rounding errors.

Interpretation:

the adapter is active, because adapter-loaded output differs from the base output;
direct merge into the bnb 4-bit base does not reproduce adapter-loaded output;
in this run, direct merged output was base-like for one prompt and a third output for two prompts.

Fresh bf16 base merge lane

Then I used a fresh non-quantized bf16 base:

bf16_base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    device_map={"": 0},
    torch_dtype=torch.bfloat16,
)

bf16_peft = PeftModel.from_pretrained(bf16_base, ADAPTER_ID)
merged_bf16 = bf16_peft.merge_and_unload()

The v3.8 classification was:

FRESH_BF16_MERGE_STILL_PASSES

Prompt-level summary:

classification	count
`BF16_MERGED_MATCHES_BF16_ADAPTER`	`3`

Prompt table:

prompt	classification	bf16 adapter == bf16 merged
`p01_lora_short`	`BF16_MERGED_MATCHES_BF16_ADAPTER`	`True`
`p02_hypertension_short`	`BF16_MERGED_MATCHES_BF16_ADAPTER`	`True`
`p03_medical_tutor`	`BF16_MERGED_MATCHES_BF16_ADAPTER`	`True`

The fresh bf16 merge emitted 0 merge warnings.

Interpretation:

direct bnb 4-bit merge:
  adapter-loaded output not reproduced

fresh bf16 base merge:
  adapter-loaded output reproduced

So this looks less like “the adapter is bad” and more like a direct bnb 4-bit merge path issue.

Related issue:

PEFT #2321 — 4-bit Linear merge warning / different generations

That issue is close because it tracks the warning, but this local validation adds:

cross-prompt output divergence;
fresh bf16 merge parity control;
current PEFT / Transformers / bnb stack confirmation.

I would file this only if maintainers want a separate PEFT issue. It should not be mixed into the original CPU/GPU split-offload issue.

Possible title:

Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity

Relevant docs:

Transformers bitsandbytes quantization docs

5. In-place dequantize is not the path I would recommend

I also tested the in-place dequantize path earlier:

peft_model.dequantize()
peft_model.merge_and_unload()

In that lane:

dequantize(): PASS
merge_and_unload(): FAIL

AttributeError: 'Parameter' object has no attribute 'quant_state'

So I would not recommend presenting in-place dequantize as the clean solution.

The safer path, based on the local result, is:

reload base in bf16/fp16
load adapter there
merge there
validate output parity

This is still resource-dependent. It worked on T4 for the small E2B validation, but that should not be generalized to all Gemma 4 sizes or longer generations.

Related Forum discussion:

HF Forum — Do I need to dequantization before merging the QLoRA?

6. What I would file / not file

File or keep tracking

A. Primary original split/offload issue

Target:

huggingface/transformers

Suggested title:

PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks

Cross-link:

huggingface/peft
huggingface/accelerate
bitsandbytes-foundation/bitsandbytes

B. `Gemma4ClippableLinear`

Use existing issue:

PEFT #3129

Add scanner/preflight evidence there if useful.

C. bnb CPU-offload wrong-device family

Use existing issue / PR:

D. direct 4-bit merge-output parity

Maybe a separate PEFT issue, if maintainers prefer:

Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity

Do not file as-is

I would avoid filing these broad claims:

CPU offload is broken
PEFT + bnb 4-bit is broken
QLoRA adapters are broken
Gemma 4 adapters all fail
merge_and_unload is generally broken

Those claims are too broad and do not match the evidence.

7. Reference links

Original Forum context

GitHub trackers

Docs / model links

Bottom line

My current interpretation is:

The original thread is still best treated as a split/offload PEFT adapter-loading problem on already-dispatched bnb 4-bit Gemma 4 models.

Some related pieces have improved upstream, especially _is_hf_initialized and Gemma4 device_map support, but that does not fully close the split/offload issue.

Gemma4ClippableLinear remains a separate PEFT target-module issue.

Direct merge into bnb 4-bit is a separate merge-output parity issue: in local v3.8 validation it still diverged from adapter-loaded output, while fresh bf16 base merge restored parity.

Small clarification

Small clarification on the local validation labels above:

When I wrote “v3.8 local validation,” that is just my local validation-bundle label. It is not an upstream release version, and I do not mean to imply that the exact notebook/package structure is important.

The only evidence I intended to carry forward from that run is the high-level classification:

direct bnb 4-bit merge did not reproduce adapter-loaded output;
fresh bf16-base merge did reproduce adapter-loaded output;
broad Gemma 4 targets still hit Gemma4ClippableLinear and failed as an unsupported PEFT target;
the optional split-dispatch lane was not re-run in that validation.

So for the original thread issue, I would still treat the existing CPU/GPU offload traces here as the main evidence. The local merge/parity check is a separate bucket.

If maintainers want a repro, I can extract a minimal script for whichever bucket is most useful:

split/offload adapter-load failure;
Gemma4ClippableLinear target-module failure;
direct 4-bit merge-output parity divergence.

Topic		Replies	Views
Multi-gpu inference llama-3.2 vision with QLoRA 🤗Accelerate	4	249	April 25, 2025
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration) 🤗Transformers	11	330	April 24, 2026
I wonder how to merge my PEFT adapter with the base model and finally get a new whole model? 🤗Transformers	27	1787	February 7, 2025
Peft 0.18.1 crashing when fine-tuning - Part 2 🤗Transformers	2	35	April 14, 2026
Peft 0.18.1 crashing when fine-tuning 🤗Transformers	4	146	April 13, 2026