CPU offloading error scenario

There were all sorts of issues with various libraries, and things got pretty tangled up :sweat_smile:, so I decided to run some tests here first to sort them out:


I did a bit more validation and I think it is useful to keep the current state split into separate buckets. This follows the same triage style as John6666’s earlier issue-draft post in this thread, but adds fresh local validation for the merge/parity bucket.

Short version

I would currently separate the situation into these issue families:

  1. Primary original issue from this thread
    PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base still looks like its own split/offload problem.

  2. Already fixed or release-improved pieces
    Some _is_hf_initialized / bnb parameter-moving behavior appears fixed or improved upstream. Gemma 4 device_map="auto" support also appears improved in recent Transformers releases.

  3. Open PEFT target-module issue
    Gemma4ClippableLinear is still a real PEFT target-module blocker for broad-target Gemma 4 adapters.

  4. Separate local finding: direct 4-bit merge parity
    Direct merge_and_unload() into a bnb 4-bit base still did not reproduce adapter-loaded output in my latest local validation. Reloading the base in bf16, loading the same adapter, and merging there still restored output parity.

So I would not collapse all of these into one GitHub issue.


1. Primary original issue: split/offloaded bnb 4-bit Gemma 4 + PEFT adapter load

I still think the primary issue from this thread is best described as:

PEFT adapter loading fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.

The important contrast is:

all-GPU bnb 4-bit + PEFT:
  works / can work

CPU/GPU split-dispatched bnb 4-bit + PEFT:
  fails in offload / dispatch / hook / quant-state paths

I would not file this as simply:

CPU offload is broken

or:

PEFT + bitsandbytes is broken

Those are too broad. The all-GPU path can work.

A better primary issue title, if this is filed, would be something like:

PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks

I would probably file this first under huggingface/transformers, while cross-linking PEFT, Accelerate, and bitsandbytes, because the failure crosses model integration, device_map/offload behavior, adapter loading, dispatch hooks, and bnb quantized state.

Related public trackers that seem relevant but not identical:


2. _is_hf_initialized looks partly fixed, but it is not the whole story

There is a closed Transformers issue for the _is_hf_initialized family:

There is also a merged Accelerate PR:

That PR is described as fixing issues when trying to move weights with bnb.

So I would classify _is_hf_initialized as:

fixed_or_release_improved_subproblem

But I would not say the original Forum issue is solved just because that subpath improved. The split/offload path still has other failure modes, especially meta tensor / dispatch / quant-state / cross-device behavior.


3. Gemma4ClippableLinear is still a separate current PEFT blocker

There is already an open PEFT issue for this:

I re-checked this in a v3.8 local validation run with current packages:

package version
torch 2.10.0+cu128
transformers 5.6.2
accelerate 1.13.0
peft 0.19.1
bitsandbytes 0.49.2
GPU Tesla T4

The broad-target adapter used for this check was:

The target scan found:

item count
Gemma4ClippableLinear hits 148
Linear4bit hits 205
total target matches 353

Adapter load still failed with:

GEMMA4_CLIPPABLE_LINEAR_UNSUPPORTED

So the current classification from this lane is:

GEMMA4_CLIPPABLE_LINEAR_STILL_BLOCKS

This should stay separate from the split/offload issue and also separate from merge-output parity. It is an adapter-target / module-type compatibility issue. Broad targets such as q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj can hit Gemma 4 wrapper modules that current PEFT does not accept.

For reporting purposes, I would use PEFT #3129 as the main tracker and not create a duplicate unless maintainers want a separate report with scanner output.

Reference docs for why broad target_modules matter:


4. New local validation: direct bnb 4-bit merge still diverges

This is not the original CPU-offload issue. It is a separate PEFT / bitsandbytes / QLoRA workflow finding.

I re-ran the local parity check in v3.8 with:

Direct 4-bit merge lane

4-bit base + adapter-loaded inference:
  adapter output differs from base output

direct merge_and_unload() into bnb 4-bit base:
  merged output does not match adapter-loaded output

The v3.8 classification was:

DIRECT_4BIT_MERGE_STILL_DIVERGES

Prompt-level summary:

classification count
DIRECT_4BIT_MERGED_MATCHES_NEITHER 2
DIRECT_4BIT_MERGED_MATCHES_BASE 1

Prompt table:

prompt classification base != adapter adapter == merged base == merged
p01_lora_short DIRECT_4BIT_MERGED_MATCHES_NEITHER True False False
p02_hypertension_short DIRECT_4BIT_MERGED_MATCHES_BASE True False True
p03_medical_tutor DIRECT_4BIT_MERGED_MATCHES_NEITHER True False False

The direct 4-bit merge emitted 525 warnings:

Merge lora module to 4-bit linear may get different generations due to rounding errors.

Interpretation:

  • the adapter is active, because adapter-loaded output differs from the base output;
  • direct merge into the bnb 4-bit base does not reproduce adapter-loaded output;
  • in this run, direct merged output was base-like for one prompt and a third output for two prompts.

Fresh bf16 base merge lane

Then I used a fresh non-quantized bf16 base:

bf16_base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    device_map={"": 0},
    torch_dtype=torch.bfloat16,
)

bf16_peft = PeftModel.from_pretrained(bf16_base, ADAPTER_ID)
merged_bf16 = bf16_peft.merge_and_unload()

The v3.8 classification was:

FRESH_BF16_MERGE_STILL_PASSES

Prompt-level summary:

classification count
BF16_MERGED_MATCHES_BF16_ADAPTER 3

Prompt table:

prompt classification bf16 adapter == bf16 merged
p01_lora_short BF16_MERGED_MATCHES_BF16_ADAPTER True
p02_hypertension_short BF16_MERGED_MATCHES_BF16_ADAPTER True
p03_medical_tutor BF16_MERGED_MATCHES_BF16_ADAPTER True

The fresh bf16 merge emitted 0 merge warnings.

Interpretation:

direct bnb 4-bit merge:
  adapter-loaded output not reproduced

fresh bf16 base merge:
  adapter-loaded output reproduced

So this looks less like “the adapter is bad” and more like a direct bnb 4-bit merge path issue.

Related issue:

That issue is close because it tracks the warning, but this local validation adds:

  • cross-prompt output divergence;
  • fresh bf16 merge parity control;
  • current PEFT / Transformers / bnb stack confirmation.

I would file this only if maintainers want a separate PEFT issue. It should not be mixed into the original CPU/GPU split-offload issue.

Possible title:

Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity

Relevant docs:


5. In-place dequantize is not the path I would recommend

I also tested the in-place dequantize path earlier:

peft_model.dequantize()
peft_model.merge_and_unload()

In that lane:

dequantize(): PASS
merge_and_unload(): FAIL

AttributeError: 'Parameter' object has no attribute 'quant_state'

So I would not recommend presenting in-place dequantize as the clean solution.

The safer path, based on the local result, is:

reload base in bf16/fp16
load adapter there
merge there
validate output parity

This is still resource-dependent. It worked on T4 for the small E2B validation, but that should not be generalized to all Gemma 4 sizes or longer generations.

Related Forum discussion:


6. What I would file / not file

File or keep tracking

A. Primary original split/offload issue

Target:

  • huggingface/transformers

Suggested title:

PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks

Cross-link:

  • huggingface/peft
  • huggingface/accelerate
  • bitsandbytes-foundation/bitsandbytes

B. Gemma4ClippableLinear

Use existing issue:

Add scanner/preflight evidence there if useful.

C. bnb CPU-offload wrong-device family

Use existing issue / PR:

D. direct 4-bit merge-output parity

Maybe a separate PEFT issue, if maintainers prefer:

Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity

Do not file as-is

I would avoid filing these broad claims:

CPU offload is broken
PEFT + bnb 4-bit is broken
QLoRA adapters are broken
Gemma 4 adapters all fail
merge_and_unload is generally broken

Those claims are too broad and do not match the evidence.


7. Reference links

Original Forum context

GitHub trackers

Docs / model links


Bottom line

My current interpretation is:

The original thread is still best treated as a split/offload PEFT adapter-loading problem on already-dispatched bnb 4-bit Gemma 4 models.

Some related pieces have improved upstream, especially _is_hf_initialized and Gemma4 device_map support, but that does not fully close the split/offload issue.

Gemma4ClippableLinear remains a separate PEFT target-module issue.

Direct merge into bnb 4-bit is a separate merge-output parity issue: in local v3.8 validation it still diverged from adapter-loaded output, while fresh bf16 base merge restored parity.

Small clarification

Small clarification on the local validation labels above:

When I wrote “v3.8 local validation,” that is just my local validation-bundle label. It is not an upstream release version, and I do not mean to imply that the exact notebook/package structure is important.

The only evidence I intended to carry forward from that run is the high-level classification:

  • direct bnb 4-bit merge did not reproduce adapter-loaded output;
  • fresh bf16-base merge did reproduce adapter-loaded output;
  • broad Gemma 4 targets still hit Gemma4ClippableLinear and failed as an unsupported PEFT target;
  • the optional split-dispatch lane was not re-run in that validation.

So for the original thread issue, I would still treat the existing CPU/GPU offload traces here as the main evidence. The local merge/parity check is a separate bucket.

If maintainers want a repro, I can extract a minimal script for whichever bucket is most useful:

  1. split/offload adapter-load failure;
  2. Gemma4ClippableLinear target-module failure;
  3. direct 4-bit merge-output parity divergence.