Multi-image edit (3 refs): artifacts at true CFG, fine on Lightning — reference-content dependent

Setup

Qwen-Image-Edit-2511 BF16 via ComfyUI
TextEncodeQwenImageEditPlus with 3 reference images (face close-up + body front + body back)
Output 1024×1536 (non-square 2:3)
Sampler: res_3m + bong_tangent (RES4LYF)

Behavior
With full CFG (2.7, 33 steps): generation reliably breaks on some reference sets — mixed artifacts (identity drift, color/texture corruption, anatomy distortion). Same parameters and seeds produce clean output on other reference sets.
With Lightning 4-step (true_cfg=1): every reference set is clean.
Pattern

1 reference (face only) → always clean, both modes
3 references → clean on some characters, broken on others — content-dependent
All references are Z-Image Turbo outputs, same prompt structure, identical dimensions
Failing sets tend to contain high-frequency content (curly hair, darker skin texture); working sets tend to be lower-frequency (straight hair, lighter skin). To be clear: this is about the rendered references, not the character identity itself.

What I’ve tried (no fix, or partial only)

I don’t even remember what I tried, but I tried a lot of things that seemed possible, and none of them worked. The workflow is below.

https://codeshare.io/5eLXqK

Question
Is this a known interaction between multi-ref token packing and the true-CFG noise_pred * (cond_norm / noise_norm) rescale path? Specifically:

Does Qwen2.5-VL’s 384² downscale produce per-token norm outliers on high-frequency reference content that get amplified across denoising steps once true CFG is active?
Is multi-image reference (3+ refs) currently only stable at distilled-CFG / Lightning, or is there a recommended setup for full CFG?

Hmm… complicated…


Short version

I would treat this as a real, repeatable interaction between multi-reference conditioning and full true-CFG, but I would not call it a confirmed public “Qwen2.5-VL 384² token-norm outlier bug.”

The best working diagnosis is:

Some 3-reference packs create an unstable conditional prediction. Full true-CFG repeatedly pushes the denoising trajectory along that unstable cond - neg direction. The norm rescale can keep the prediction magnitude bounded, but it cannot guarantee that the update direction is semantically safe. Lightning with true_cfg=1 avoids that failure path, so the same references can look clean.

In this setup, the failure probably is not one single thing. It is likely the intersection of:

  • 3 image references with different roles: face close-up + front body + back body.
  • High-frequency synthetic reference content, especially dense curls, skin microtexture, fabric/texture detail, or turbo-model sharpening.
  • Qwen-Image-Edit’s dual image-conditioning design, where the input image is routed through both Qwen2.5-VL semantic conditioning and VAE appearance conditioning.
  • Full true-CFG, which uses the conditional/negative prediction difference and then applies a norm-ratio rescale.
  • ComfyUI preprocessing / latent geometry, including possible hidden resize or mismatch between reference conditioning geometry and KSampler latent geometry.
  • Custom sampler/scheduler behavior from res_3m + bong_tangent.

I would not conclude that 3+ image references are “only stable on Lightning.” Full multi-image Qwen edit can work. But this exact corner — 3 refs + high-frequency synthetic refs + full true-CFG + non-square output + custom sampler — is fragile enough that I would not run it as a one-shot full-CFG workflow without reducing ambiguity first.


Why this failure pattern is meaningful

The key pattern is:

1 face reference only:
  clean in full CFG
  clean in Lightning

3 references:
  clean for some character/reference sets
  broken for other character/reference sets

same parameters and seeds:
  clean or broken depending on reference content

Lightning 4-step, true_cfg=1:
  clean for every reference set

That pattern strongly argues against a simple “bad seed” or “bad prompt” explanation.

If it were purely random sampling, you would expect less consistent dependence on the reference set. If it were purely output resolution, BF16, or the model checkpoint, you would expect the breakage to be less dependent on which character/reference pack is used. If it were purely the sampler, one reference should also be more fragile.

Instead, the most useful interpretation is:

reference content
→ unstable multi-image conditioning
→ full true-CFG amplifies it
→ artifact appears over denoising steps

The reference-content dependence is the important clue.


What is known from the model design

Qwen-Image-Edit does not use the input image in only one way. The model card says the input image is fed into Qwen2.5-VL for semantic control and into the VAE encoder for visual appearance control.

Source: Qwen/Qwen-Image-Edit model card

That matters because the artifact can originate in either path:

Channel What it controls How it can fail
Qwen2.5-VL semantic path identity meaning, object roles, face/body interpretation, picture-to-picture binding identity drift, wrong reference role, subject blending, face/body confusion
VAE / reference-latent path color, texture, local visual detail, clothing material, skin/hair texture texture corruption, color bleed, hair/skin over-detail, local anatomy deformation

Your symptoms span both channels:

  • identity drift → semantic/reference-binding instability.
  • color/texture corruption → appearance/reference-latent instability.
  • anatomy distortion → reference-role confusion plus guidance/sampler amplification.

That is why “just improve the prompt” is usually not enough. Prompt clarity helps, but the model is also consuming multiple visual encodings and reference latents.


What is known from Qwen true-CFG

Diffusers’ Qwen docs distinguish normal guidance_scale from real Qwen classifier-free guidance. In the Qwen pipeline, true CFG is enabled with true_cfg_scale plus a negative_prompt; even an empty negative prompt can activate the branch.

Source: Diffusers QwenImage docs

The Qwen edit pipeline source says true CFG is enabled when true_cfg_scale > 1 and a negative prompt is provided. It also says higher guidance links the image more closely to the prompt, usually at the cost of lower image quality.

Source: Diffusers Qwen Image Edit pipeline source

A Qwen-Image-Edit-Plus pipeline copy shows the relevant true-CFG calculation:

Source: Qwen Image Edit Plus pipeline copy

The important part is essentially:

comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)

cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)

noise_pred = comb_pred * (cond_norm / noise_norm)

The norm rescale is easy to overinterpret. It can keep the combined prediction’s magnitude near the conditional prediction’s magnitude, but it does not prove that the direction is safe.

In plain language:

right-sized vector
does not necessarily mean
right semantic direction

So if the 3-reference conditional prediction is already unstable, true-CFG can repeatedly push the trajectory in a bad direction while the norm rescale still appears mathematically “reasonable.”


Why Lightning being clean does not disprove the full-CFG issue

Lightning is not simply “the same full model but fewer steps.” The Lightning card describes step distillation that reduces standard inference to 4 steps and gives a large speedup compared with standard 40-step inference.

Source: lightx2v/Qwen-Image-Edit-2511-Lightning

So this comparison:

Full BF16:
  true_cfg = 2.7
  33 steps
  artifacts on some 3-ref packs

Lightning:
  true_cfg = 1
  4 steps
  clean on all 3-ref packs

should be interpreted as:

long full-guidance trajectory:
  fragile

short distilled / no-true-CFG trajectory:
  robust

It should not be interpreted as:

the reference pack is universally safe

The Lightning result is useful because it says the references contain enough usable information to make a clean image. But it does not prove that full true-CFG can use that same information stably.


Is the 384² Qwen2.5-VL downscale the root cause?

Possible, but not proven.

A more careful statement is:

High-frequency rendered references can plausibly produce unstable visual-token or reference-latent conditioning after resizing/downsampling. That instability can then appear downstream as a larger conditional-vs-negative prediction difference during full true-CFG. But I would not claim that the root cause is specifically Qwen2.5-VL per-token norm outliers from 384² resizing unless tensor logging confirms it.

Why the suspicion is technically reasonable:

Qwen2-VL-style image preprocessing uses smart_resize, with dimensions made divisible by a factor tied to patch/merge behavior. The source shows defaults such as patch_size=14, merge_size=2, and a resize factor of 28.

Sources:

That makes this diagnostic worth testing:

384 / 28 = 13.714...
392 / 28 = 14

So if a node exposes target_vl_size, testing 392 instead of 384 is useful. It does not prove the theory, but it removes one avoidable grid-alignment variable.

The high-frequency-content hypothesis has three possible locations:

Suspect Meaning Test
VL token path resized/patchified semantic image tokens become unstable for dense curls, skin texture, or sharp synthetic detail smooth only the VL input; keep VAE/ref input original
VAE/reference-latent path appearance latents over-inject high-frequency texture smooth only the VAE/ref input; keep VL input original
CFG path full true-CFG amplifies an unstable conditional prediction sweep true_cfg from 1.0 to 2.7

The current evidence proves content-dependent instability. It does not yet prove exactly where inside the stack that instability begins.


Are 3+ references only stable on Lightning?

No, not generally.

Qwen-Image-Edit-2511 is explicitly presented as improving character consistency, and Qwen/Diffusers-style edit pipelines support image-conditioned editing. The issue is not “multi-image references are impossible.” The issue is that your exact setup is a high-risk corner.

Source: Qwen/Qwen-Image-Edit-2511 model card

The fragile combination is:

3 references
face + front body + back body
synthetic high-frequency rendered references
BF16 full model
true_cfg = 2.7
33 denoising steps
1024x1536 output
RES4LYF res_3m + bong_tangent

So the better answer is:

3+ references are not Lightning-only, but this exact 3-ref/full-CFG/custom-sampler setup should be treated as fragile. Use Lightning for first-pass composition, then use lower-CFG full BF16 refinement with fewer or weaker references.


The biggest practical change: stop treating all 3 references equally

The three images have different jobs.

Reference Correct role Wrong role to avoid
Face close-up identity, face structure, hairline, expression, age impression full outfit geometry, back clothing
Body front front outfit, body proportions, front silhouette, color placement face identity
Body back rear clothing, back silhouette, hair length from behind face identity, front anatomy, skin texture source

The back reference is especially dangerous because it can contain strong hair/body/clothing cues without a face identity anchor. If it participates fully in the VAE/reference-latent path, it can inject body/texture information that competes with the face and front-body references.

If the node supports separate semantic/reference participation, test:

Ref VL semantic path VAE/ref-latent path
Face on on
Front body on on
Back body on off initially

In plain language:

Use the back reference as semantic guidance first.
Do not let it be a full appearance/reference-latent source unless needed.

Only enable the back reference as a full VAE/reference latent if the final output is a back-view image or if rear outfit construction is essential.


Prompt template I would use

The official Qwen-Image-Edit-2511 app prompt guidance says multi-image prompts should clearly specify which image’s element is being modified.

Source: Qwen/Qwen-Image-Edit-2511 app.py prompt guidance

For a front-facing or general full-body portrait, I would use a prompt like this:

Use the references with strict roles.

Picture 1 is the identity reference. Preserve the same face identity, facial structure, age impression, hairline, and overall character identity from Picture 1.

Picture 2 is the front body and outfit reference. Use it for body proportions, front silhouette, clothing shape, front-view outfit details, and color placement.

Picture 3 is the back outfit reference only. Use it only for back-side clothing construction, rear silhouette, and hair length visible from behind. Do not use Picture 3 to change the face, facial identity, skin texture, expression, or front-facing anatomy.

Generate one coherent person in a clean full-body 2:3 portrait. Do not blend identities. Do not average the face across references. Keep natural anatomy, stable skin texture, stable hair texture, and consistent clothing material. Do not copy rear-view anatomy into the front view.

For a back-view output, change the roles:

Use the references with strict roles.

Picture 1 is the identity and hair reference. Preserve the same character identity and overall hair type from Picture 1, but do not invent a visible face because the final image is a back view.

Picture 2 is the front outfit reference. Use it only for consistent clothing design, material, and color placement.

Picture 3 is the back outfit reference. Use it as the primary source for the rear silhouette, back-side clothing construction, hair length from behind, and rear material layout.

Generate a clean full-body back-view 2:3 portrait of one coherent person. Keep the outfit consistent across front and back references. Keep anatomy natural. Do not create extra limbs, duplicate hair masses, face fragments, or mixed front/back body structure.

The point is not literary quality. The point is to reduce reference-role ambiguity.


Reference preprocessing I would apply

Because the failing sets are high-frequency rendered references, I would preprocess the references before changing more sampler/model knobs.

The goal is not to change identity. The goal is to reduce unstable synthetic microtexture.

Face reference

Operation Strength Reason
crop to face/head/upper shoulders strong remove irrelevant body/background tokens
remove or simplify busy background strong reduce unrelated visual tokens
mild denoise low remove synthetic turbo grain
mild de-sharpen / reduce local contrast low reduce patch-level hair/skin spikes
preserve face identity/color strict avoid changing identity

Front body reference

Operation Strength Reason
clean full-body crop strong keep body/outfit information
simplify background medium/strong reduce irrelevant reference detail
mild de-sharpen low reduce texture overbinding
preserve clothing color layout strict this is the outfit source

Back body reference

Operation Strength Reason
clean back-body crop strong keep only rear silhouette/outfit
simplify background strong reduce irrelevant tokens
mild denoise / de-sharpen medium this reference is high-risk
avoid full VAE/ref path initially strong prevent appearance over-injection

Avoid prompts that increase microtexture pressure:

ultra detailed skin, sharp curly hair, high texture, 4k, hyper detailed material

Prefer stability wording:

stable natural skin texture, coherent hair texture, clean silhouette, consistent material, natural anatomy

Geometry and latent-size checks

Treat hidden geometry mismatch as a first-class suspect.

A ComfyUI issue says TextEncodeQwenImageEdit targets roughly 1M pixels internally, and warns that if the latent passed to KSampler is not based on that same effective geometry, unintended zooming can occur.

Source: ComfyUI issue #9481: 1MP fixed resizing in TextEncodeQwenImageEdit

That issue is about zoom/drift, but it still matters here. Under strong true-CFG, a geometry/reference-latent mismatch can show up as broader corruption, not just zoom.

Avoid a graph shaped like this:

reference images
→ TextEncode internal resize
→ VAE/reference latents at another size
→ KSampler latent at another size
→ output 1024x1536

Prefer one geometry source of truth:

preprocess/crop/pad references
→ choose final target geometry
→ build or encode latents consistently
→ feed references through controlled semantic/ref-latent paths
→ sample at the intended 1024x1536 geometry

A community workflow for Qwen edit zooming reports fixing most zooming by disconnecting the VAE input from TextEncodeQwenImageEditPlus, adding VAE Encode per source, and chaining ReferenceLatent nodes.

Source: Reddit workflow: Qwen-Image-Edit unzooming / reference latent fix

Even though your symptom is more than zooming, I would still test explicit reference latents because it removes a major hidden-variable class.


Try target_vl_size=392 if available

If the node exposes a VL target size, test:

384 → 392

Reason:

Qwen2-VL visual preprocessing uses a 28-pixel factor.
384 is not divisible by 28.
392 is divisible by 28.

Source: Qwen2-VL image processor source

Interpretation:

Result Interpretation
392 improves failing refs VL resize/grid behavior is involved
392 changes nothing issue is more likely VAE/ref path, CFG, sampler, or reference binding
392 worsens output revert; the node may already be doing its own correction

This is a diagnostic, not a guaranteed fix.


Do not diagnose with res_3m + bong_tangent first

The custom sampler may be useful for final output, but it is not the right baseline.

Use this order:

1. Latest Diffusers QwenImageEditPlusPipeline, if possible
2. Official/native ComfyUI Qwen-Image-Edit-2511 workflow
3. Native Comfy + same 3 refs
4. Native Comfy + true-CFG sweep
5. Your workflow without RES4LYF
6. Your workflow with RES4LYF

ComfyUI’s official Qwen-Image-Edit-2511 guide is the right baseline for the Comfy side.

Source: ComfyUI Qwen-Image-Edit-2511 guide

If the failure appears only after step 6, the root is not simply “Qwen multi-ref token packing.” It is more likely:

multi-ref conditioning
× true CFG
× custom sampler/scheduler behavior

Recommended settings

Stable production path

Use the path that already works.

Model: Qwen-Image-Edit-2511 + Lightning
Steps: 4
true_cfg: 1.0
Output: 1024x1536
References: face + front body + back body
Prompt: strict reference-role prompt
Negative prompt: blank/minimal
Sampler: Lightning-compatible/native first

Use this when reliability matters.

Best quality/stability compromise: two-stage workflow

This is my strongest practical recommendation.

Stage 1:
  Model: Qwen-Image-Edit-2511 Lightning
  Refs: face + front body + back body
  Steps: 4
  true_cfg: 1.0
  Output: 1024x1536
  Goal: stable composition and reference binding

Stage 2:
  Model: Qwen-Image-Edit-2511 BF16
  Source: Stage 1 output
  Refs: face only, or face + front body
  Back ref: omit unless generating a back view
  Steps: 25-40
  true_cfg: 1.3-1.7
  negative_prompt: " "
  Sampler: native first
  Goal: detail, identity polish, clothing consistency, texture repair

This works because Stage 1 avoids the long full-CFG failure trajectory, and Stage 2 no longer needs to solve the entire 3-reference binding problem.

One-pass full-BF16 attempt

If you want one-pass full BF16, I would start here:

Model: Qwen-Image-Edit-2511 BF16
Pipeline/workflow: native Diffusers or native Comfy first
Output: 1024x1536
Steps: 33-40
true_cfg_scale: 1.4-1.6
negative_prompt: " "
Sampler: native first
References:
  Picture 1: face close-up, identity source
  Picture 2: body front, body/outfit source
  Picture 3: body back, semantic-only if possible
target_vl_size: try 392 if available

Do not start at true_cfg=2.7 for failing packs. Treat 2.7 as a stress-test value.

Likely CFG ranges:

true CFG Expected behavior
1.0 no true-CFG pressure; baseline
1.2 very safe
1.4-1.6 best starting range
1.8 possibly usable
2.1 likely starts exposing fragile refs
2.4-2.7 likely artifact zone for failing packs
3.0+ not useful until everything else is controlled

Test matrix I would run

Phase A: find the CFG cliff

Use one working reference set and one failing reference set. Keep seed, prompt, output size, model, dtype, and workflow fixed.

Test true CFG
A 1.0
B 1.2
C 1.4
D 1.6
E 1.8
F 2.1
G 2.4
H 2.7

Interpretation:

Result Meaning
clean through 1.8, breaks at 2.4-2.7 classic CFG cliff / over-guidance
breaks at 1.2-1.4 reference pack or geometry is unstable before CFG pressure
clean native, broken with RES4LYF sampler interaction
broken even at 1.0 not true-CFG; likely reference/geometry issue

Phase B: isolate reference combinations

Run the same seed/settings with:

Test References
1 face only
2 front body only
3 back body only
4 face + front
5 face + back
6 front + back
7 face + front + back

Interpretation:

Observation Likely cause
face + front clean, adding back breaks back reference over-conditioning
face + back breaks back reference conflicts with identity
front + back breaks body geometry / outfit-reference conflict
all pairs clean, 3 refs break token/reference packing or attention overload
only high-frequency sets break reference-content sensitivity

Phase C: test high-frequency-content hypothesis

Create two versions of each reference:

  • original
  • mildly denoised/de-sharpened/background-simplified

Then test:

Test VL input VAE/ref input
A original original
B smoothed original
C original smoothed
D smoothed smoothed

Interpretation:

Result Meaning
B fixes it Qwen2.5-VL semantic-token path likely involved
C fixes it VAE/reference-latent path likely involved
D fixes it both paths contribute
none fix it CFG/sampler/reference-role issue is more likely

If you can instrument the pipeline

If you can patch the Python pipeline or node implementation, log the true-CFG internals after computing noise_pred, neg_noise_pred, and comb_pred, before the scheduler step.

delta = noise_pred - neg_noise_pred
comb_pred = neg_noise_pred + true_cfg_scale * delta

cond_norm = torch.norm(noise_pred.float(), dim=-1)
neg_norm = torch.norm(neg_noise_pred.float(), dim=-1)
delta_norm = torch.norm(delta.float(), dim=-1)
comb_norm = torch.norm(comb_pred.float(), dim=-1)

scale_ratio = cond_norm / (comb_norm + 1e-8)

cos = torch.nn.functional.cosine_similarity(
    noise_pred.float(),
    neg_noise_pred.float(),
    dim=-1,
)

Log per step:

cond_norm p50 / p95 / p99
delta_norm p50 / p95 / p99
scale_ratio p95 / p99 / max
cosine p01 / p50

Compare:

working refs vs failing refs
1 ref vs 3 refs
true_cfg 1.5 vs 2.7
original refs vs smoothed refs
back ref included vs omitted
back ref full vs semantic-only

Strong evidence for the CFG hypothesis would be:

failing 3-ref packs show larger delta_norm
failing 3-ref packs show lower cond/negative cosine similarity
failing 3-ref packs show scale_ratio spikes
smoothing VL or VAE refs reduces those spikes
removing or weakening the back ref reduces those spikes

This is how to move from “plausible explanation” to actual evidence.


Related public cases / resources

Core Qwen / Diffusers / Comfy references

Lightning / distilled path

Geometry / zoom / latent mismatch cases

Multi-reference / binding context

CFG / guidance context


What I would avoid for now

Avoid this combination while diagnosing:

true_cfg = 2.7
heavy negative prompt
all 3 refs as full VAE/reference latents
back ref treated as identity
RES4LYF during diagnosis
uncontrolled hidden 1MP resize
unprocessed high-frequency synthetic refs
generic “ultra detailed” prompt terms

That combination is almost exactly the unstable corner.


Final recommendation

My best practical workflow would be:

Use Lightning for composition.
Use BF16 full model only for low-CFG refinement.

More specifically:

1. Generate stable composition:
   Qwen-Image-Edit-2511 Lightning
   3 refs
   true_cfg = 1
   4 steps
   1024x1536

2. Refine:
   Qwen-Image-Edit-2511 BF16
   source = Lightning output
   refs = face only or face + front body
   true_cfg = 1.4-1.6
   negative_prompt = " "
   25-40 steps
   native sampler first

If a one-pass full-BF16 run is required, use:

true_cfg = 1.4-1.6
negative_prompt = " "
face = identity source
front body = outfit/body source
back body = semantic-only first
target_vl_size = 392 if available
native sampler first
explicit/latent-aware reference geometry if possible

The exact “384² token-norm outlier” mechanism is plausible but unproven. The safer conclusion is:

This is content-dependent multi-reference conditioning instability, exposed and amplified by full true-CFG. Lightning avoids the fragile path. Full CFG can still work, but only after reducing CFG pressure, reference-role ambiguity, hidden geometry mismatch, and sampler confounds.

To be honest, I’m going to put this on hold for now. Because it works well with all the characters I’m trying to create—even the anime character with the fox tail :D. It only had issues with one of them. If I remove the negative prompt and tweak the prompts a bit, it works out. But it’s still very fragile. I’ll set aside more time later to find a solution for this. But for now, I’m skipping it.

Thanks anyway for the detailed explanation. I’ve read some of it, and I’ll read the rest when I get back to this problem.

But actually, I don’t think we even need a reference image of the back (unless there’s a tattoo or something). Because I think the front view can produce an optimal result in rear shots as well, based on the waist-to-hip ratio.

But I’d like to add something. In my opinion, Qwen Image Edit 2511 really does produce results with consistent quality. Maybe I haven’t even managed to find the optimal settings yet, but it’s great. Since it generates scenes from scratch, it retains that AI-generated look. That’s why I’m trying out some LoRa’s that work well with text-to-image generation.