Hmm… complicated…
Short version
I would treat this as a real, repeatable interaction between multi-reference conditioning and full true-CFG, but I would not call it a confirmed public “Qwen2.5-VL 384² token-norm outlier bug.”
The best working diagnosis is:
Some 3-reference packs create an unstable conditional prediction. Full true-CFG repeatedly pushes the denoising trajectory along that unstable cond - neg direction. The norm rescale can keep the prediction magnitude bounded, but it cannot guarantee that the update direction is semantically safe. Lightning with true_cfg=1 avoids that failure path, so the same references can look clean.
In this setup, the failure probably is not one single thing. It is likely the intersection of:
- 3 image references with different roles: face close-up + front body + back body.
- High-frequency synthetic reference content, especially dense curls, skin microtexture, fabric/texture detail, or turbo-model sharpening.
- Qwen-Image-Edit’s dual image-conditioning design, where the input image is routed through both Qwen2.5-VL semantic conditioning and VAE appearance conditioning.
- Full true-CFG, which uses the conditional/negative prediction difference and then applies a norm-ratio rescale.
- ComfyUI preprocessing / latent geometry, including possible hidden resize or mismatch between reference conditioning geometry and KSampler latent geometry.
- Custom sampler/scheduler behavior from
res_3m + bong_tangent.
I would not conclude that 3+ image references are “only stable on Lightning.” Full multi-image Qwen edit can work. But this exact corner — 3 refs + high-frequency synthetic refs + full true-CFG + non-square output + custom sampler — is fragile enough that I would not run it as a one-shot full-CFG workflow without reducing ambiguity first.
Why this failure pattern is meaningful
The key pattern is:
1 face reference only:
clean in full CFG
clean in Lightning
3 references:
clean for some character/reference sets
broken for other character/reference sets
same parameters and seeds:
clean or broken depending on reference content
Lightning 4-step, true_cfg=1:
clean for every reference set
That pattern strongly argues against a simple “bad seed” or “bad prompt” explanation.
If it were purely random sampling, you would expect less consistent dependence on the reference set. If it were purely output resolution, BF16, or the model checkpoint, you would expect the breakage to be less dependent on which character/reference pack is used. If it were purely the sampler, one reference should also be more fragile.
Instead, the most useful interpretation is:
reference content
→ unstable multi-image conditioning
→ full true-CFG amplifies it
→ artifact appears over denoising steps
The reference-content dependence is the important clue.
What is known from the model design
Qwen-Image-Edit does not use the input image in only one way. The model card says the input image is fed into Qwen2.5-VL for semantic control and into the VAE encoder for visual appearance control.
Source: Qwen/Qwen-Image-Edit model card
That matters because the artifact can originate in either path:
| Channel |
What it controls |
How it can fail |
| Qwen2.5-VL semantic path |
identity meaning, object roles, face/body interpretation, picture-to-picture binding |
identity drift, wrong reference role, subject blending, face/body confusion |
| VAE / reference-latent path |
color, texture, local visual detail, clothing material, skin/hair texture |
texture corruption, color bleed, hair/skin over-detail, local anatomy deformation |
Your symptoms span both channels:
- identity drift → semantic/reference-binding instability.
- color/texture corruption → appearance/reference-latent instability.
- anatomy distortion → reference-role confusion plus guidance/sampler amplification.
That is why “just improve the prompt” is usually not enough. Prompt clarity helps, but the model is also consuming multiple visual encodings and reference latents.
What is known from Qwen true-CFG
Diffusers’ Qwen docs distinguish normal guidance_scale from real Qwen classifier-free guidance. In the Qwen pipeline, true CFG is enabled with true_cfg_scale plus a negative_prompt; even an empty negative prompt can activate the branch.
Source: Diffusers QwenImage docs
The Qwen edit pipeline source says true CFG is enabled when true_cfg_scale > 1 and a negative prompt is provided. It also says higher guidance links the image more closely to the prompt, usually at the cost of lower image quality.
Source: Diffusers Qwen Image Edit pipeline source
A Qwen-Image-Edit-Plus pipeline copy shows the relevant true-CFG calculation:
Source: Qwen Image Edit Plus pipeline copy
The important part is essentially:
comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
noise_pred = comb_pred * (cond_norm / noise_norm)
The norm rescale is easy to overinterpret. It can keep the combined prediction’s magnitude near the conditional prediction’s magnitude, but it does not prove that the direction is safe.
In plain language:
right-sized vector
does not necessarily mean
right semantic direction
So if the 3-reference conditional prediction is already unstable, true-CFG can repeatedly push the trajectory in a bad direction while the norm rescale still appears mathematically “reasonable.”
Why Lightning being clean does not disprove the full-CFG issue
Lightning is not simply “the same full model but fewer steps.” The Lightning card describes step distillation that reduces standard inference to 4 steps and gives a large speedup compared with standard 40-step inference.
Source: lightx2v/Qwen-Image-Edit-2511-Lightning
So this comparison:
Full BF16:
true_cfg = 2.7
33 steps
artifacts on some 3-ref packs
Lightning:
true_cfg = 1
4 steps
clean on all 3-ref packs
should be interpreted as:
long full-guidance trajectory:
fragile
short distilled / no-true-CFG trajectory:
robust
It should not be interpreted as:
the reference pack is universally safe
The Lightning result is useful because it says the references contain enough usable information to make a clean image. But it does not prove that full true-CFG can use that same information stably.
Is the 384² Qwen2.5-VL downscale the root cause?
Possible, but not proven.
A more careful statement is:
High-frequency rendered references can plausibly produce unstable visual-token or reference-latent conditioning after resizing/downsampling. That instability can then appear downstream as a larger conditional-vs-negative prediction difference during full true-CFG. But I would not claim that the root cause is specifically Qwen2.5-VL per-token norm outliers from 384² resizing unless tensor logging confirms it.
Why the suspicion is technically reasonable:
Qwen2-VL-style image preprocessing uses smart_resize, with dimensions made divisible by a factor tied to patch/merge behavior. The source shows defaults such as patch_size=14, merge_size=2, and a resize factor of 28.
Sources:
That makes this diagnostic worth testing:
384 / 28 = 13.714...
392 / 28 = 14
So if a node exposes target_vl_size, testing 392 instead of 384 is useful. It does not prove the theory, but it removes one avoidable grid-alignment variable.
The high-frequency-content hypothesis has three possible locations:
| Suspect |
Meaning |
Test |
| VL token path |
resized/patchified semantic image tokens become unstable for dense curls, skin texture, or sharp synthetic detail |
smooth only the VL input; keep VAE/ref input original |
| VAE/reference-latent path |
appearance latents over-inject high-frequency texture |
smooth only the VAE/ref input; keep VL input original |
| CFG path |
full true-CFG amplifies an unstable conditional prediction |
sweep true_cfg from 1.0 to 2.7 |
The current evidence proves content-dependent instability. It does not yet prove exactly where inside the stack that instability begins.
Are 3+ references only stable on Lightning?
No, not generally.
Qwen-Image-Edit-2511 is explicitly presented as improving character consistency, and Qwen/Diffusers-style edit pipelines support image-conditioned editing. The issue is not “multi-image references are impossible.” The issue is that your exact setup is a high-risk corner.
Source: Qwen/Qwen-Image-Edit-2511 model card
The fragile combination is:
3 references
face + front body + back body
synthetic high-frequency rendered references
BF16 full model
true_cfg = 2.7
33 denoising steps
1024x1536 output
RES4LYF res_3m + bong_tangent
So the better answer is:
3+ references are not Lightning-only, but this exact 3-ref/full-CFG/custom-sampler setup should be treated as fragile. Use Lightning for first-pass composition, then use lower-CFG full BF16 refinement with fewer or weaker references.
The biggest practical change: stop treating all 3 references equally
The three images have different jobs.
| Reference |
Correct role |
Wrong role to avoid |
| Face close-up |
identity, face structure, hairline, expression, age impression |
full outfit geometry, back clothing |
| Body front |
front outfit, body proportions, front silhouette, color placement |
face identity |
| Body back |
rear clothing, back silhouette, hair length from behind |
face identity, front anatomy, skin texture source |
The back reference is especially dangerous because it can contain strong hair/body/clothing cues without a face identity anchor. If it participates fully in the VAE/reference-latent path, it can inject body/texture information that competes with the face and front-body references.
If the node supports separate semantic/reference participation, test:
| Ref |
VL semantic path |
VAE/ref-latent path |
| Face |
on |
on |
| Front body |
on |
on |
| Back body |
on |
off initially |
In plain language:
Use the back reference as semantic guidance first.
Do not let it be a full appearance/reference-latent source unless needed.
Only enable the back reference as a full VAE/reference latent if the final output is a back-view image or if rear outfit construction is essential.
Prompt template I would use
The official Qwen-Image-Edit-2511 app prompt guidance says multi-image prompts should clearly specify which image’s element is being modified.
Source: Qwen/Qwen-Image-Edit-2511 app.py prompt guidance
For a front-facing or general full-body portrait, I would use a prompt like this:
Use the references with strict roles.
Picture 1 is the identity reference. Preserve the same face identity, facial structure, age impression, hairline, and overall character identity from Picture 1.
Picture 2 is the front body and outfit reference. Use it for body proportions, front silhouette, clothing shape, front-view outfit details, and color placement.
Picture 3 is the back outfit reference only. Use it only for back-side clothing construction, rear silhouette, and hair length visible from behind. Do not use Picture 3 to change the face, facial identity, skin texture, expression, or front-facing anatomy.
Generate one coherent person in a clean full-body 2:3 portrait. Do not blend identities. Do not average the face across references. Keep natural anatomy, stable skin texture, stable hair texture, and consistent clothing material. Do not copy rear-view anatomy into the front view.
For a back-view output, change the roles:
Use the references with strict roles.
Picture 1 is the identity and hair reference. Preserve the same character identity and overall hair type from Picture 1, but do not invent a visible face because the final image is a back view.
Picture 2 is the front outfit reference. Use it only for consistent clothing design, material, and color placement.
Picture 3 is the back outfit reference. Use it as the primary source for the rear silhouette, back-side clothing construction, hair length from behind, and rear material layout.
Generate a clean full-body back-view 2:3 portrait of one coherent person. Keep the outfit consistent across front and back references. Keep anatomy natural. Do not create extra limbs, duplicate hair masses, face fragments, or mixed front/back body structure.
The point is not literary quality. The point is to reduce reference-role ambiguity.
Reference preprocessing I would apply
Because the failing sets are high-frequency rendered references, I would preprocess the references before changing more sampler/model knobs.
The goal is not to change identity. The goal is to reduce unstable synthetic microtexture.
Face reference
| Operation |
Strength |
Reason |
| crop to face/head/upper shoulders |
strong |
remove irrelevant body/background tokens |
| remove or simplify busy background |
strong |
reduce unrelated visual tokens |
| mild denoise |
low |
remove synthetic turbo grain |
| mild de-sharpen / reduce local contrast |
low |
reduce patch-level hair/skin spikes |
| preserve face identity/color |
strict |
avoid changing identity |
Front body reference
| Operation |
Strength |
Reason |
| clean full-body crop |
strong |
keep body/outfit information |
| simplify background |
medium/strong |
reduce irrelevant reference detail |
| mild de-sharpen |
low |
reduce texture overbinding |
| preserve clothing color layout |
strict |
this is the outfit source |
Back body reference
| Operation |
Strength |
Reason |
| clean back-body crop |
strong |
keep only rear silhouette/outfit |
| simplify background |
strong |
reduce irrelevant tokens |
| mild denoise / de-sharpen |
medium |
this reference is high-risk |
| avoid full VAE/ref path initially |
strong |
prevent appearance over-injection |
Avoid prompts that increase microtexture pressure:
ultra detailed skin, sharp curly hair, high texture, 4k, hyper detailed material
Prefer stability wording:
stable natural skin texture, coherent hair texture, clean silhouette, consistent material, natural anatomy
Geometry and latent-size checks
Treat hidden geometry mismatch as a first-class suspect.
A ComfyUI issue says TextEncodeQwenImageEdit targets roughly 1M pixels internally, and warns that if the latent passed to KSampler is not based on that same effective geometry, unintended zooming can occur.
Source: ComfyUI issue #9481: 1MP fixed resizing in TextEncodeQwenImageEdit
That issue is about zoom/drift, but it still matters here. Under strong true-CFG, a geometry/reference-latent mismatch can show up as broader corruption, not just zoom.
Avoid a graph shaped like this:
reference images
→ TextEncode internal resize
→ VAE/reference latents at another size
→ KSampler latent at another size
→ output 1024x1536
Prefer one geometry source of truth:
preprocess/crop/pad references
→ choose final target geometry
→ build or encode latents consistently
→ feed references through controlled semantic/ref-latent paths
→ sample at the intended 1024x1536 geometry
A community workflow for Qwen edit zooming reports fixing most zooming by disconnecting the VAE input from TextEncodeQwenImageEditPlus, adding VAE Encode per source, and chaining ReferenceLatent nodes.
Source: Reddit workflow: Qwen-Image-Edit unzooming / reference latent fix
Even though your symptom is more than zooming, I would still test explicit reference latents because it removes a major hidden-variable class.
Try target_vl_size=392 if available
If the node exposes a VL target size, test:
384 → 392
Reason:
Qwen2-VL visual preprocessing uses a 28-pixel factor.
384 is not divisible by 28.
392 is divisible by 28.
Source: Qwen2-VL image processor source
Interpretation:
| Result |
Interpretation |
392 improves failing refs |
VL resize/grid behavior is involved |
392 changes nothing |
issue is more likely VAE/ref path, CFG, sampler, or reference binding |
392 worsens output |
revert; the node may already be doing its own correction |
This is a diagnostic, not a guaranteed fix.
Do not diagnose with res_3m + bong_tangent first
The custom sampler may be useful for final output, but it is not the right baseline.
Use this order:
1. Latest Diffusers QwenImageEditPlusPipeline, if possible
2. Official/native ComfyUI Qwen-Image-Edit-2511 workflow
3. Native Comfy + same 3 refs
4. Native Comfy + true-CFG sweep
5. Your workflow without RES4LYF
6. Your workflow with RES4LYF
ComfyUI’s official Qwen-Image-Edit-2511 guide is the right baseline for the Comfy side.
Source: ComfyUI Qwen-Image-Edit-2511 guide
If the failure appears only after step 6, the root is not simply “Qwen multi-ref token packing.” It is more likely:
multi-ref conditioning
× true CFG
× custom sampler/scheduler behavior
Recommended settings
Stable production path
Use the path that already works.
Model: Qwen-Image-Edit-2511 + Lightning
Steps: 4
true_cfg: 1.0
Output: 1024x1536
References: face + front body + back body
Prompt: strict reference-role prompt
Negative prompt: blank/minimal
Sampler: Lightning-compatible/native first
Use this when reliability matters.
Best quality/stability compromise: two-stage workflow
This is my strongest practical recommendation.
Stage 1:
Model: Qwen-Image-Edit-2511 Lightning
Refs: face + front body + back body
Steps: 4
true_cfg: 1.0
Output: 1024x1536
Goal: stable composition and reference binding
Stage 2:
Model: Qwen-Image-Edit-2511 BF16
Source: Stage 1 output
Refs: face only, or face + front body
Back ref: omit unless generating a back view
Steps: 25-40
true_cfg: 1.3-1.7
negative_prompt: " "
Sampler: native first
Goal: detail, identity polish, clothing consistency, texture repair
This works because Stage 1 avoids the long full-CFG failure trajectory, and Stage 2 no longer needs to solve the entire 3-reference binding problem.
One-pass full-BF16 attempt
If you want one-pass full BF16, I would start here:
Model: Qwen-Image-Edit-2511 BF16
Pipeline/workflow: native Diffusers or native Comfy first
Output: 1024x1536
Steps: 33-40
true_cfg_scale: 1.4-1.6
negative_prompt: " "
Sampler: native first
References:
Picture 1: face close-up, identity source
Picture 2: body front, body/outfit source
Picture 3: body back, semantic-only if possible
target_vl_size: try 392 if available
Do not start at true_cfg=2.7 for failing packs. Treat 2.7 as a stress-test value.
Likely CFG ranges:
| true CFG |
Expected behavior |
1.0 |
no true-CFG pressure; baseline |
1.2 |
very safe |
1.4-1.6 |
best starting range |
1.8 |
possibly usable |
2.1 |
likely starts exposing fragile refs |
2.4-2.7 |
likely artifact zone for failing packs |
3.0+ |
not useful until everything else is controlled |
Test matrix I would run
Phase A: find the CFG cliff
Use one working reference set and one failing reference set. Keep seed, prompt, output size, model, dtype, and workflow fixed.
| Test |
true CFG |
| A |
1.0 |
| B |
1.2 |
| C |
1.4 |
| D |
1.6 |
| E |
1.8 |
| F |
2.1 |
| G |
2.4 |
| H |
2.7 |
Interpretation:
| Result |
Meaning |
clean through 1.8, breaks at 2.4-2.7 |
classic CFG cliff / over-guidance |
breaks at 1.2-1.4 |
reference pack or geometry is unstable before CFG pressure |
| clean native, broken with RES4LYF |
sampler interaction |
broken even at 1.0 |
not true-CFG; likely reference/geometry issue |
Phase B: isolate reference combinations
Run the same seed/settings with:
| Test |
References |
| 1 |
face only |
| 2 |
front body only |
| 3 |
back body only |
| 4 |
face + front |
| 5 |
face + back |
| 6 |
front + back |
| 7 |
face + front + back |
Interpretation:
| Observation |
Likely cause |
| face + front clean, adding back breaks |
back reference over-conditioning |
| face + back breaks |
back reference conflicts with identity |
| front + back breaks |
body geometry / outfit-reference conflict |
| all pairs clean, 3 refs break |
token/reference packing or attention overload |
| only high-frequency sets break |
reference-content sensitivity |
Phase C: test high-frequency-content hypothesis
Create two versions of each reference:
- original
- mildly denoised/de-sharpened/background-simplified
Then test:
| Test |
VL input |
VAE/ref input |
| A |
original |
original |
| B |
smoothed |
original |
| C |
original |
smoothed |
| D |
smoothed |
smoothed |
Interpretation:
| Result |
Meaning |
| B fixes it |
Qwen2.5-VL semantic-token path likely involved |
| C fixes it |
VAE/reference-latent path likely involved |
| D fixes it |
both paths contribute |
| none fix it |
CFG/sampler/reference-role issue is more likely |
If you can instrument the pipeline
If you can patch the Python pipeline or node implementation, log the true-CFG internals after computing noise_pred, neg_noise_pred, and comb_pred, before the scheduler step.
delta = noise_pred - neg_noise_pred
comb_pred = neg_noise_pred + true_cfg_scale * delta
cond_norm = torch.norm(noise_pred.float(), dim=-1)
neg_norm = torch.norm(neg_noise_pred.float(), dim=-1)
delta_norm = torch.norm(delta.float(), dim=-1)
comb_norm = torch.norm(comb_pred.float(), dim=-1)
scale_ratio = cond_norm / (comb_norm + 1e-8)
cos = torch.nn.functional.cosine_similarity(
noise_pred.float(),
neg_noise_pred.float(),
dim=-1,
)
Log per step:
cond_norm p50 / p95 / p99
delta_norm p50 / p95 / p99
scale_ratio p95 / p99 / max
cosine p01 / p50
Compare:
working refs vs failing refs
1 ref vs 3 refs
true_cfg 1.5 vs 2.7
original refs vs smoothed refs
back ref included vs omitted
back ref full vs semantic-only
Strong evidence for the CFG hypothesis would be:
failing 3-ref packs show larger delta_norm
failing 3-ref packs show lower cond/negative cosine similarity
failing 3-ref packs show scale_ratio spikes
smoothing VL or VAE refs reduces those spikes
removing or weakening the back ref reduces those spikes
This is how to move from “plausible explanation” to actual evidence.
Related public cases / resources
Core Qwen / Diffusers / Comfy references
Lightning / distilled path
Geometry / zoom / latent mismatch cases
Multi-reference / binding context
CFG / guidance context
What I would avoid for now
Avoid this combination while diagnosing:
true_cfg = 2.7
heavy negative prompt
all 3 refs as full VAE/reference latents
back ref treated as identity
RES4LYF during diagnosis
uncontrolled hidden 1MP resize
unprocessed high-frequency synthetic refs
generic “ultra detailed” prompt terms
That combination is almost exactly the unstable corner.
Final recommendation
My best practical workflow would be:
Use Lightning for composition.
Use BF16 full model only for low-CFG refinement.
More specifically:
1. Generate stable composition:
Qwen-Image-Edit-2511 Lightning
3 refs
true_cfg = 1
4 steps
1024x1536
2. Refine:
Qwen-Image-Edit-2511 BF16
source = Lightning output
refs = face only or face + front body
true_cfg = 1.4-1.6
negative_prompt = " "
25-40 steps
native sampler first
If a one-pass full-BF16 run is required, use:
true_cfg = 1.4-1.6
negative_prompt = " "
face = identity source
front body = outfit/body source
back body = semantic-only first
target_vl_size = 392 if available
native sampler first
explicit/latent-aware reference geometry if possible
The exact “384² token-norm outlier” mechanism is plausible but unproven. The safer conclusion is:
This is content-dependent multi-reference conditioning instability, exposed and amplified by full true-CFG. Lightning avoids the fragile path. Full CFG can still work, but only after reducing CFG pressure, reference-role ambiguity, hidden geometry mismatch, and sampler confounds.