Wan2.2 i2v (clarifications needed regarding settings on low vram system)

hey fellas, how are you all doing?

i’ll get right to the point. im new to i2v but after a few weeks of trial and error im starting to get the hang of things. there is a lot of information available out there but my god is it contradictory. sometimes ive gotten better results by doing the exact opposite of what a lot of people absolutely swear by. it doesnt help that im on a very modest entry level setup (8gb 4060 laptop). a lot of forum posts and articles seem to be written with heftier setups in mind. so i think i have reached the limit of what i can achieve through personal experimentation and following advice not designed for my quantized setup. things like “oh, just increase cfg for more obedience” while i increased gradually one decimal at a time with the same exact seed and prompt till cfg3 and saw absolutely ZERO difference. my render time is way too slow to effectively do micro tweaks and get any real results in an acceptable time frame. (im working on getting a 24gb setup in about 4 months but i dont wanna sit idle in the meantime)

“I’m running a highly optimized Wan I2V setup. It’s a GGUF-based workflow using the wan2.2-rapid-aio weights. quantized Q4_K GGUF. I’ve got SageAttention and BlockSwapping enabled to handle the VRAM load.”

i gotta be honest, ive gotten very good results on occasion, even when it comes to very specific things. my problem is consistency. i will work on one picture for an entire day through trial and error, then as soon as its a slightly different picture i have to start from scratch. im using sa_solver (beta) 4 steps, 1.0cfg, denoise 0.6, sd3 shift 8. i know this sounds ridiculous but i swear of all the things ive tried this is the only one that gets me any results so far (and quite quickly as well). the reason i use ksampler and not ksampler advance is because when i used wan2.1 the video was exactly as the source picture but after transitioning to 2.2 the video was way blurrier , less vivid, less sharp, felt less hd and the general hue is much more “reddish”. the denoise option set to 0.6 helps with all that

i just wanna know what should be my settings starting point and what should i be gradually increasing to see improvement. the most important aspect for me is face consistency and obedience to prompts. remember, i have an 8gb card so things start to turn into a deep fried lsd fever dream with cfg past 3. would you suggest any additional nodes to my setup? different sampler? different settings? my goal is to be 100% authentic to the source image. no embellishment from ai. no scifi themes or fantasy or anything like that. i basically wanna make the picture move, thats it. if im in a scenario where the ai isnt obeying prompts what should i do? (other than “try different prompts”. cause im trying to isolate how far i can push the ai before refining prompts. i wanna get a good baseline first)

please give me some pointers for my specific setup and goals, thank you very much in advance

4-step variation might not be suitable for your purpose:


Wan2.2 I2V on 8GB VRAM: practical baseline for source-faithful animation

For your exact goal — make the picture move, keep the same face, keep the same identity, keep the same lighting/background/clothing, and avoid AI embellishment — I would not tune this like a normal high-VRAM Wan2.2 setup.

Your current results are not strange:

  • CFG from 1 to ~3 doing almost nothing is plausible in a 4-step Rapid/Lightning-style workflow.
  • CFG above ~3 turning the output into overcooked chaos is also plausible.
  • Denoise around 0.6 helping sharpness/color/source fidelity is not ridiculous.
  • Different source images needing different settings usually means the workflow has too many interacting variables: GGUF quantization, Rapid/distilled weights, sampler, scheduler, shift, text encoder quality, VAE, offloading, source-image difficulty, and the Wan2.2 High/Low-noise expert split.

The core point:

Do not treat CFG as the main “obedience knob” in your setup.
For 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V, CFG is a small final adjustment, not the steering wheel.

The knobs I would tune first are:

  1. source image quality / crop
  2. denoise
  3. motion size
  4. shift
  5. Low-noise step count / Low-noise quantization
  6. sampler branch
  7. text encoder quantization
  8. CFG last

Useful references:


1. Why your current setup is hard to tune

You are not simply running “Wan2.2.” You are running a stacked compromise:

Wan2.2-style I2V
+ Rapid/AIO or distilled behavior
+ GGUF quantization
+ Q4-class compression
+ 4-step sampling
+ SageAttention
+ BlockSwap/offload
+ 8GB laptop VRAM
+ denoise below 1.0
+ SD3 shift
+ image conditioning

That matters because one setting can appear useless when another part of the stack is dominating.

For example, CFG may appear to do nothing because:

  • the model was distilled/merged for CFG 1
  • 4 steps are too few for CFG to gradually steer the output
  • image conditioning dominates the text
  • the negative prompt is weak or mostly inactive at CFG 1
  • quantization reduces sensitivity to small guidance changes
  • the sampler/scheduler/shift combination matters more than CFG
  • the High/Low-noise split is doing more than the text guidance

Some Rapid/AIO model cards explicitly say their models are intended for CFG 1 and 4 steps. See the WAN2.2 Rapid All-in-One model card. Wan2.2-Lightning similarly describes a 4-step distilled path, so it should not be tuned like a normal 20–30 step diffusion workflow. See Wan2.2-Lightning.

So your observation — “CFG 1 to 3 did nothing, then above 3 broke everything” — is consistent with this kind of workflow.


2. The most important Wan2.2 idea: High-noise vs Low-noise experts

Wan2.2 A14B uses a Mixture-of-Experts style denoising structure. The official Wan2.2 repo describes MoE as separating the denoising process across timesteps with specialized expert models. See Wan2.2 official GitHub.

In practical I2V terms:

Part Mostly affects If weak/wrong, you may see
High-noise expert broad motion, layout, pose, composition, camera direction scene drift, pose weirdness, motion chaos, composition changes
Low-noise expert face detail, eyes, mouth, skin, clothing texture, color, final sharpness face melting, blur, color shift, unstable eyes/mouth, loss of likeness

For your goal, Low-noise behavior is extremely important.

If the face changes, the first fix is usually not “raise CFG.” More likely fixes are:

  • lower denoise
  • reduce the requested motion
  • add more Low-noise steps
  • use a better Low-noise quant if possible
  • check the VAE
  • crop/use a clearer source face
  • avoid cinematic/camera-heavy prompts
  • avoid LoRAs until the baseline is stable

WanMoeKSampler is relevant if you are using separate High/Low Wan2.2 A14B models. Its README says it is designed for Wan2.2 A14B-style MoE workflows and avoids manually guessing the High-to-Low switch point. See WanMoeKSampler.


3. Best starting point for your actual goal

Your goal is not “maximum cinematic transformation.” Your goal is:

same person
same face
same identity
same clothing
same lighting
same background
small natural movement
static camera
no embellishment

So I would start conservative.

Recommended baseline for your current Rapid/AIO-style setup

Sampler: sa_solver / beta, if that is your current most reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
SD3 shift: 8 as current control, then test 5 and 6
Resolution: 512–640px long side while testing
Frames: 33–49 while testing
FPS: 12–16
Motion: subtle
Camera: static
LoRAs: none during baseline
Upscaling/interpolation: none during baseline
Face restore: none during baseline

This is not meant to be the final “best possible” setup. It is the control setup. You need a repeatable control before changing settings.


4. Do not micro-tweak CFG

On your hardware, micro-tweaking CFG by 0.1 is a bad use of time.

Instead of:

1.0
1.1
1.2
1.3
1.4
...

Use coarse tests:

CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test

For your setup, I would treat CFG like this:

CFG Practical meaning
1.0 safest Rapid/Lightning-style baseline
1.5 mild text pressure
2.0 moderate text pressure
2.5 upper useful range to test
3.0 stress-test boundary
>3.0 likely to overcook identity, color, texture, or motion

If CFG 1.5–2.5 gives no meaningful obedience improvement, stop chasing CFG. The bottleneck is probably elsewhere.


5. Denoise is probably more important than CFG for you

For source-faithful I2V, denoise is one of the strongest identity controls.

Denoise Expected behavior
0.40–0.50 most faithful, least motion, may look stiff
0.50–0.60 best starting zone for “make the image move”
0.60–0.70 more motion, more identity risk
0.70+ more transformation, more AI invention

Since you already found 0.6 useful, I would not abandon it. I would test:

Denoise 0.50
Denoise 0.55
Denoise 0.60
Denoise 0.65

Pick the best identity/motion balance.

If the face changes:

lower denoise first
reduce motion second
add Low-noise steps third
only then try CFG changes

If there is no movement:

raise denoise slightly
make the action simpler and more literal
avoid cinematic wording

6. Shift: test coarse values only

Do not test tiny shift increments. Test meaningful jumps.

For your current setup:

Shift 5
Shift 6
Shift 8

The LightX2V Wan2.2 I2V working-guide discussion recommends:

Euler sampler
Simple scheduler
Shift 5
2 High steps
2 Low steps

Source: LightX2V Wan2.2 I2V working guide discussion

That does not automatically mean shift 5 is best for your current Rapid/AIO branch, but it is a strong branch to test.


7. Sampler advice

For your current Rapid/AIO branch

If sa_solver / beta / 4 steps / CFG 1 / denoise 0.6 / shift 8 is the only thing giving you usable results, keep it as the control.

Do not throw it away just because it sounds weird.

Rapid/distilled/merged models can have very specific intended recipes. The model card for the Rapid AIO family says the models are intended for CFG 1 and 4 steps, and different versions list different sampler recommendations. See WAN2.2 Rapid All-in-One.

For a Lightning-style branch

Test this separately:

Sampler: Euler
Scheduler: Simple
Steps: 4
CFG: 1.0
Shift: 5
Denoise: 0.55–0.60

That lines up with public LightX2V/Wan2.2-Lightning guidance. See Wan2.2-Lightning and the LightX2V working-guide discussion.

Compare this branch against your current sa_solver / beta control. Do not mix the two while testing.


8. Low-noise steps may help face consistency more than CFG

If your workflow exposes the High/Low split, test this before pushing CFG:

Test High steps Low steps Purpose
A 2 2 fastest 4-step baseline
B 2 4 more face/detail finishing
C 4 4 balanced reference
D 4 6 stronger finishing if time allows
E 6 4 more broad structure/motion

For your goal, I would test:

2 High / 2 Low
2 High / 4 Low
4 High / 4 Low

If 2/2 is blurry but 2/4 improves face/detail, that tells you the Low-noise stage was underpowered.


9. Quantization: Q4_K_M is not automatically best on 8GB

On paper, higher quantization quality is better. In practice, on an 8GB laptop GPU, a heavier quant can cause more offload pressure, swapping, instability, or unusable render times.

The QuantStack Wan2.2 I2V A14B GGUF repo lists approximate model sizes such as:

Q3_K_S: 6.52 GB
Q3_K_M: 7.18 GB
Q4_K_S: 8.75 GB
Q4_K_M: 9.65 GB
Q5_K_S: 10.1 GB
Q5_K_M: 10.8 GB
Q6_K: 12 GB
Q8_0: 15.4 GB

Source: QuantStack Wan2.2 I2V A14B GGUF

For an 8GB 4060 laptop, I would test:

Test High-noise Low-noise Why
A Q3_K_M Q3_K_M safest low-VRAM baseline
B Q4_K_S Q4_K_S better quality if stable
C Q3_K_M Q4_K_S prioritize face/detail
D Q4_K_S Q3_K_M prioritize structure/motion
E Q4_K_M Q4_K_M only if the above are stable

For your priority, I would try:

High-noise: Q3_K_M
Low-noise: Q4_K_S

before assuming:

High-noise: Q4_K_M
Low-noise: Q4_K_M

Why: Low-noise has more influence on final face detail, skin, eyes, mouth, color, and sharpness. If you can only “spend” quality somewhere, spend it on Low-noise first.


10. Text encoder quantization matters for prompt obedience

If prompt obedience feels weak, do not only blame CFG. The text encoder can matter too.

The city96 UMT5 XXL encoder GGUF card recommends Q5_K_M or larger for best results, while noting that smaller models may still be acceptable in resource-constrained situations. It lists Q3_K_M around 3.06GB, Q4_K_M around 3.66GB, and Q5_K_M around 4.15GB. See city96 UMT5 XXL encoder GGUF.

For your system:

UMT5 Q3_K_M: safest
UMT5 Q4_K_M: reasonable baseline
UMT5 Q5_K_M: better prompt understanding if RAM/offload behavior is tolerable

If CFG does not improve obedience, a better text encoder may help more than CFG micro-tweaks.


11. VAE check: important for color and softness

If Wan2.2 looks redder, softer, or less vivid than expected, check the VAE.

The official ComfyUI Wan2.2 guide distinguishes the model components for different workflows. The 14B I2V workflow uses separate High/Low I2V models and a Wan VAE component; the 5B TI2V workflow uses its own 5B model/VAE setup. See ComfyUI official Wan2.2 guide.

A VAE mismatch can show up as:

red/yellow color cast
soft decode
loss of vividness
skin tone shift
general haze
reconstruction blur

If color is your issue, test VAE/workflow correctness before trying to fix it with prompt words like “neutral color” or “no red tint.”


12. Source image quality matters more than people admit

For face consistency, the source image should have:

clear face
visible eyes
visible mouth
not too small in frame
not heavily compressed
not extreme side profile
not harsh shadow over one eye
not heavy motion blur
not strong fisheye distortion
not sunglasses covering identity
not hands blocking the face

A simple rule:

If the source face is small or unclear, the model has to invent face detail during motion.
When it invents face detail, identity changes.

For baseline testing, use a clean portrait or half-body image. You can do fancy shots later.


13. Prompt style for source-faithful animation

Use a boring prompt. Do not make it cinematic. Do not add style words. Do not describe a new scene.

Positive prompt baseline

A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.

Negative prompt baseline

different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, blurry, low detail, melted face, extra teeth

Important: at CFG 1, the negative prompt may do very little. Judge negative prompting mostly at CFG 1.5–2.5.


14. Prompt obedience testing

Do not test obedience with complex motion first.

Bad obedience tests:

turns around
walks forward
raises both hands
laughs widely
talks
dances
camera orbits around the subject
wind blows hair dramatically

Good obedience tests:

one subtle blink
gentle breathing only
slight smile
very small head tilt
tiny eye movement

A model that cannot obey “one subtle blink” is not ready for “turns head, smiles, and raises hand.”

Better prompt wording

Instead of:

The woman turns her head and smiles at the camera while wind blows through her hair.

Use:

The person makes a very small natural smile while keeping the same face, same pose, same hairstyle, same clothing, same lighting, and same background. Static camera.

The second prompt gives the model less room to invent.


15. What to do when the model does not obey

First classify the failure.

Failure Likely cause First fix
prompt action ignored too few steps, weak text encoder, action too subtle, distilled limitation slightly raise denoise or simplify action
face changes denoise too high, Low-noise weak, source face unclear, motion too large lower denoise / add Low steps
red tint VAE/model/sampler/shift issue check VAE, test shift/sampler
blurry face Low-noise too weak, too few steps, low quant, low resolution add Low steps / better Low quant
background changes denoise too high, prompt invites scene change lower denoise / static camera prompt
too much motion denoise/CFG/shift too high, Rapid merge exaggeration lower denoise or reduce action
no motion denoise too low, prompt too static denoise +0.05

The order I would use:

1. Keep CFG at 1.0.
2. Make the action simpler and more literal.
3. Tune denoise: 0.50 / 0.55 / 0.60 / 0.65.
4. Test shift: 5 / 6 / 8.
5. Add Low-noise steps if available.
6. Improve Low-noise quantization if possible.
7. Test CFG 1.5 / 2.0 / 2.5.
8. Stop before CFG 3 if identity starts changing.

16. Recommended experiment matrix

Do not run huge matrices at full resolution. Use short clips first.

Keep these fixed:

same image
same seed
same prompt
same resolution
same frame count
same workflow branch

Matrix A — denoise

CFG: 1.0
Steps: 4
Shift: current value
Sampler: current best

Test:

0.50
0.55
0.60
0.65

Pick the best identity/motion balance.

Matrix B — shift

Use the best denoise from Matrix A.

Shift 5
Shift 6
Shift 8

Pick the best.

Matrix C — CFG

Use best denoise + best shift.

CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test

Pick the highest CFG that does not alter identity.

Matrix D — High/Low steps

If available:

2 High / 2 Low
2 High / 4 Low
4 High / 4 Low

If face detail improves with more Low steps, you found a better lever than CFG.

Matrix E — quantization

If using separate GGUF High/Low models:

Q3_K_M High / Q3_K_M Low
Q3_K_M High / Q4_K_S Low
Q4_K_S High / Q4_K_S Low

Avoid assuming Q4_K_M is worth the offload cost on 8GB.


17. Additional nodes: what I would and would not add

Worth testing later: WanMoeKSampler

Use it if you are working with separate Wan2.2 A14B High/Low models.

Good for:

clean A14B High/Low workflows
reducing manual High/Low split guessing
debugging MoE transition behavior

Not a fix for:

bad source image
bad VAE
too much denoise
bad prompt
4-step model limitations

Source: WanMoeKSampler

Required for GGUF: ComfyUI-GGUF

Use the proper GGUF loader rather than treating GGUF like a normal checkpoint. The ComfyUI-GGUF README says to replace the stock “Load Diffusion Model” with the “Unet Loader (GGUF)” node. See ComfyUI-GGUF.

Probably skip at 4 steps: CacheDiT

CacheDiT is more useful when you have enough steps to amortize the cache/warmup overhead. For Wan2.2 14B, its README says to use the dedicated Wan Cache Optimizer for best results with the MoE High/Low structure. See ComfyUI-CacheDiT.

My practical rule:

4 steps: skip CacheDiT
6–8 steps: probably skip unless testing
12–20 steps: consider CacheDiT

Useful but separate branch: Kijai WanVideoWrapper

Kijai’s wrapper is useful and often gets Wan-specific optimizations quickly. The official Wan2.2 repo lists it as an alternative implementation. See Wan2.2 official GitHub and Kijai ComfyUI-WanVideoWrapper.

But treat it as a separate branch. Do not change wrapper + sampler + LoRAs + resolution all at once.


18. Things I would avoid during baseline testing

Avoid:

720p
81+ frames
large camera movement
large head turns
talking/lip motion
multiple LoRAs
face restore nodes
interpolation while judging motion
upscaling while judging source fidelity
CFG above 3
high denoise
changing sampler + CFG + denoise together
testing tiny CFG increments

Especially avoid this kind of starting point:

8GB VRAM
A14B Q4_K_M
720p
81 frames
4-step Rapid/AIO
SageAttention
BlockSwap
multiple LoRAs
CFG above 3

That can produce occasional good clips, but it is a terrible learning baseline because too many variables are interacting.


19. Practical final recommendation

For your current setup, I would start here:

Sampler: sa_solver / beta if this is your current reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
Shift: 8 as current control; test 5 and 6 separately
Resolution: 512–640px long side
Frames: 33–49
Prompt: subtle realistic motion only
Camera: static
LoRAs: none
Upscale/interpolation/face restore: off during testing

Then tune in this order:

1. Source image/crop
2. Denoise
3. Shift
4. Low-noise steps
5. Low-noise quantization
6. Text encoder quantization
7. CFG 1.5–2.5
8. Sampler branch
9. Extra nodes

My strongest advice:

Stop trying to solve source fidelity with CFG.
For your setup, source fidelity is mostly denoise + Low-noise behavior + source image quality + correct workflow components. CFG is only a small final pressure adjustment.


20. Short summary

  • Your current results are plausible for 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V.
  • CFG is not the main control. Do not micro-tune it.
  • Use CFG 1.0 as baseline; test 1.5 / 2.0 / 2.5 only after denoise and shift.
  • Use denoise 0.50–0.60 for source-faithful animation.
  • If faces drift, lower denoise or improve Low-noise behavior before raising CFG.
  • If available, test 2 High / 4 Low and 4 High / 4 Low.
  • For 8GB VRAM, test Q3_K_M and Q4_K_S before assuming Q4_K_M is best.
  • If you can only improve one expert, improve Low-noise first for face/detail.
  • Use short 33–49 frame clips at 512–640px while testing.
  • Avoid 720p, long clips, multiple LoRAs, and post-processing until the baseline is stable.

Thank you for sharing.

before i posted i wanted to message you directly because ive seen you give great advice to so many people on the forum (until i realized you cant dm users here). thank you very much, youre doing gods work! i will try all your advice and take a good look at the links! much appreciated

i gave the low/high noise unets a try. at first it went from normal first 3 frames to complete blur (like 90% blur). then i fiddled with the settings and got it to remove the blur in the center of the image but sides still blurry. but the quality seems very poor. theres a weird pixelation, not digital like squares, its more like crosshatching. and it introduced a weird lighting artifact. strong yellow light flashing in the middle. any recommended base settings to start with? i started with cfg1 on both but it was a no go

To put it simply, there’s a suspicion that settings intended for a different model have gotten mixed in?:


Wan2.2 I2V-A14B High/Low UNets: blur, crosshatching, yellow flash — likely causes and clean baseline

Looking at the workflow screenshot, the problem is probably not mainly the prompt. It looks more like a sampling schedule / High-Low boundary / step count / VAE / distilled-vs-normal workflow mismatch problem.

The suspicious settings in the screenshot are:

HighNoise GGUF -> ModelSamplingSD3 shift 5.00 -> WanMoeKSampler model_high_noise
LowNoise GGUF  -> ModelSamplingSD3 shift 5.00 -> WanMoeKSampler model_low_noise

WanMoeKSampler:
  boundary: 0.750
  add_noise: enable
  steps: 6
  cfg_high_noise: 1.5
  cfg_low_noise: 2.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 4.00
  return_with_leftover_noise: disable

The short version:

The screenshot looks like a hybrid between a normal Wan2.2 High/Low UNet workflow and a 4-step Lightning/LightX2V-style workflow. That hybrid zone can easily cause heavy blur, side blur, crosshatching texture, and yellow lighting flashes.


1. Biggest issue: boundary = 0.750

For Wan2.2 I2V, boundary = 0.750 is the first thing I would change.

The WanMoeKSampler README says the Wan2.2 boundary is around:

Wan2.2 T2V: 0.875
Wan2.2 I2V: 0.900

It also explains that this boundary is a diffusion timestep, not a denoising step. The actual switch step depends on total steps, sampler, scheduler, and sigma shift.

So for Wan2.2 I2V, reset this:

boundary: 0.750

to this:

boundary: 0.900

Why this matters

Wan2.2 A14B uses separate denoising experts:

Expert Main job
High-noise expert early structure, broad layout, motion, pose, composition
Low-noise expert later detail, face, eyes, mouth, skin, color, texture, final sharpness

The Wan2.2 I2V-A14B model card describes this High-noise / Low-noise MoE design and the idea that the experts specialize in different denoising stages.

If the boundary is too low, the High-noise model can stay active too long and the Low-noise model may not get enough useful refinement time.

That can look like:

first frames look okay
then the clip turns blurry
center improves but sides remain mushy
fine texture looks scratchy/crosshatched
lighting becomes unstable
faces fail to refine

So the first clean correction is:

boundary: 0.900

2. Second issue: steps = 6 is too low for judging normal High/Low UNets

Six steps is very low for the normal Wan2.2 I2V-A14B High/Low model pair.

It can be useful as a quick smoke test, but it is not a fair quality test unless you are using a proper distilled / Lightning / LightX2V setup.

For the normal High/Low UNets, I would test:

steps: 12

If that is too slow on 8GB VRAM, use this only as a compromise:

steps: 8

But I would not judge the normal High/Low pair from 6 steps. At 6 steps, the Low-noise expert may simply not have enough time to resolve detail.

Symptoms of too few steps:

crosshatching texture
unfinished skin/detail
soft edges
side blur
poor face detail
color flicker
lighting pulses

3. Third issue: you may be applying shift twice

The screenshot shows:

ModelSamplingSD3 shift: 5.00

before both models, plus:

WanMoeKSampler sigma_shift: 4.00

inside the WanMoeKSampler.

While debugging, that is too ambiguous. Use one source of shift only.

Recommended cleanup

For the first stable baseline, remove the two ModelSamplingSD3 nodes:

HighNoise GGUF -> WanMoeKSampler model_high_noise
LowNoise GGUF  -> WanMoeKSampler model_low_noise

Then set this inside WanMoeKSampler:

sigma_shift: 5.0

This gives you one clear place controlling the shift.

Why 5.0? The LightX2V Wan2.2 I2V model card recommends Euler with:

shift: 5.0
guidance_scale: 1.0

for its distilled branch. More importantly, 5.0 is also a sane first test value when cleaning up the graph.

The key point is:

Do not run ModelSamplingSD3 shift 5 plus WanMoeKSampler sigma_shift 4 while trying to diagnose artifacts.

After you get a stable baseline, you can test whether the external ModelSamplingSD3 nodes help. But they should not be part of the first diagnosis pass.


4. Fourth issue: CFG values are in the wrong middle zone

The screenshot uses:

cfg_high_noise: 1.5
cfg_low_noise: 2.0

That is neither a strict Lightning/LightX2V recipe nor a normal High/Low baseline.

You need to decide which branch you are testing.


Branch A — normal Wan2.2 I2V-A14B High/Low UNets

Use this branch if you are loading the normal HighNoise and LowNoise GGUFs without Lightning/LightX2V LoRAs.

In this branch, CFG 1.0 is usually too weak. CFG 1.0 is mostly a Rapid/Lightning/distilled habit, not a universal Wan2.2 setting.

Recommended baseline:

High model:
  Wan2.2 I2V-A14B HighNoise GGUF

Low model:
  Wan2.2 I2V-A14B LowNoise GGUF

Remove:
  ModelSamplingSD3 nodes before WanMoeKSampler

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test size:
  33 frames
  512-640px long side
  fixed seed

Disable during baseline:
  LoRAs
  upscalers
  interpolation
  face restore
  post-sharpening
  color correction

If 12 steps is too slow:

boundary: 0.900
steps: 8
cfg_high_noise: 3.0
cfg_low_noise: 3.0
sampler_name: euler
scheduler: simple
sigma_shift: 5.0

But treat 8 steps as a sanity test, not a final quality test.


Branch B — Lightning / LightX2V / distilled 4-step branch

Use this branch only if you are using matching Lightning/LightX2V I2V LoRAs or a proper distilled LightX2V setup.

The LightX2V Wan2.2 I2V card recommends:

Euler scheduler
shift: 5.0
guidance_scale: 1.0

It describes this as running without CFG. The README also says the distilled model is built for substantially fewer inference steps, specifically 4-step-style use.

Strict distilled baseline:

High model:
  compatible Wan2.2 I2V-A14B HighNoise model

Low model:
  compatible Wan2.2 I2V-A14B LowNoise model

LoRAs:
  matching I2V Lightning/LightX2V High LoRA
  matching I2V Lightning/LightX2V Low LoRA
  strength: 1.0 each

Remove:
  external ModelSamplingSD3 nodes during baseline

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 4
  cfg_high_noise: 1.0
  cfg_low_noise: 1.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test size:
  33 frames
  512-640px long side
  fixed seed

Do not mix this with the normal branch.

Bad hybrid zone:

normal High/Low GGUFs
+ no matching distilled LoRAs
+ 6 steps
+ CFG around 1-2
+ boundary 0.750
+ external shift 5
+ internal sigma_shift 4

That is exactly the kind of setup that can produce blur, crosshatching, and flashing.


5. VAE check: very important

For Wan2.2 14B I2V, check that you are using:

wan_2.1_vae.safetensors

The ComfyUI Wan2.2 docs and ComfyUI Wan2.2 examples point to wan_2.1_vae.safetensors for the 14B workflows.

A wrong or mismatched VAE can look like:

soft decode
general haze
yellow/red color cast
skin tone shift
center glow
lighting flash
poor reconstruction
blurred details

Do not try to fix a VAE mismatch with prompts like “no yellow light.” Fix the VAE first.


6. Artifact-by-artifact diagnosis

A. “First 3 frames normal, then 90% blur”

Most likely causes:

boundary too low
too few total steps
Low-noise expert starts too late
shift schedule conflict
wrong VAE
normal UNets being run like a distilled 4-step model

Fix order:

1. boundary: 0.900
2. remove external ModelSamplingSD3 nodes
3. sigma_shift: 5.0 inside WanMoeKSampler
4. VAE: wan_2.1_vae.safetensors
5. normal branch: steps 12, CFG 3.0 / 3.0
6. distilled branch: steps 4, CFG 1.0 / 1.0, matching LoRAs only

B. “Center improved but sides are still blurry”

Likely causes:

not enough Low-noise refinement
bad High/Low boundary
low step count
resolution/aspect stress
VAE softness
quantization/offload instability
post-processing or resize issue

Try:

33 frames only
512-640px long side
boundary 0.900
steps 12 if normal branch
correct VAE
no post nodes
no upscaler
no interpolation
no face restore

Also use clean dimensions. Examples:

512x288
576x320
640x360
640x384
384x640 for portrait

Avoid large or odd dimensions while debugging.


C. “Crosshatching texture, not square pixelation”

That usually means incomplete or unstable denoising, not classic video compression.

Most likely causes:

6 steps is too low
boundary is wrong
GGUF quantization is stressed
shift schedule is confused
Low-noise refinement is underpowered
VAE decode is wrong or mismatched

The QuantStack Wan2.2 I2V-A14B GGUF page lists approximate quant sizes such as:

Q2_K:    5.3 GB
Q3_K_S:  6.52 GB
Q3_K_M:  7.18 GB
Q4_K_S:  8.75 GB
Q4_K_M:  9.65 GB
Q5_K_S: 10.1 GB
Q5_K_M: 10.8 GB
Q6_K:   12 GB
Q8_0:   15.4 GB

On an 8GB laptop GPU, Q4_K_M can be theoretically better but practically worse if it causes too much offloading, swapping, or instability.

Low-VRAM quant tests:

Test A:
  High: Q3_K_M
  Low:  Q3_K_M

Test B:
  High: Q3_K_M
  Low:  Q4_K_S

Test C:
  High: Q4_K_S
  Low:  Q4_K_S

For face/detail fidelity, the most interesting test is:

High: Q3_K_M
Low:  Q4_K_S

Reason: the Low-noise model is the detail finisher.


D. “Strong yellow light flashing in the middle”

This is probably not a prompt issue.

Likely causes:

wrong VAE
double shift / schedule conflict
LightX2V LoRA trajectory mismatch
normal High/Low UNets using distilled settings
too few steps
bad High/Low boundary
quantization + low-step instability

Fix order:

1. confirm VAE = wan_2.1_vae.safetensors
2. remove external ModelSamplingSD3 nodes
3. boundary = 0.900
4. sigma_shift = 5.0
5. normal branch: 12 steps, CFG 3.0 / 3.0
6. distilled branch: 4 steps, CFG 1.0 / 1.0, matching LoRAs only
7. disable upscaler/interpolation/face restore
8. test 33 frames at 512-640px long side

A negative prompt can include yellow flash, but if the denoising path or VAE is wrong, the prompt will not reliably fix it.


7. What to check in the console

Check where WanMoeKSampler actually switches from High-noise to Low-noise.

Look for something equivalent to:

switching model at step X

Do not reason from boundary alone. The WanMoeKSampler README explains that diffusion timestep is not the same thing as denoising step.

For a 4-step distilled branch, you generally want something close to:

High: 2 steps
Low:  2 steps

For a normal 12-step branch, you want enough Low-noise steps left to refine detail. If Low-noise only gets a tiny part of the run, blur and poor texture are expected.


8. Text encoder check

If prompt obedience is weak, do not only raise CFG. Text encoder quantization can matter.

The city96 UMT5 XXL encoder GGUF page says Q5_K_M or larger is recommended for best results, while smaller models can still be acceptable in constrained setups.

Approximate sizes listed there include:

Q3_K_M: about 3.06 GB
Q4_K_M: about 3.66 GB
Q5_K_M: about 4.15 GB
Q8_0:   about 6.04 GB
F16:    about 11.4 GB

For an 8GB GPU setup:

UMT5 Q3_K_M:
  safest memory option

UMT5 Q4_K_M:
  good low-VRAM baseline

UMT5 Q5_K_M:
  better prompt understanding if system RAM/offload behavior allows it

Weak prompt obedience may be text-encoder-related, not just CFG-related.


9. Suggested prompt while debugging

Use a boring source-faithful prompt. Do not use cinematic lighting while debugging a yellow lighting artifact.

Positive

A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.

Negative

different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, yellow flash, blurry, low detail, melted face, extra teeth, crosshatching, noisy texture

At CFG 1.0, the negative prompt may have little practical effect. It should matter more in the normal branch at CFG around 3.0.


10. Minimal troubleshooting plan

Run these in order. Change only one branch at a time.

Test 1 — normal High/Low sanity test

Remove:
  both ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Video:
  33 frames
  512-640px long side
  fixed seed

Disable:
  LoRAs
  upscaler
  interpolation
  face restore
  postprocessing

If this improves blur/crosshatching/yellow flash, the previous issue was probably:

boundary too low
too few steps
CFG too low for normal branch
shift conflict
VAE mismatch

Test 2 — cheaper normal-branch sanity test

If 12 steps is too slow:

Same as Test 1, but:

steps: 8

If 8 looks bad but 12 improves, the issue is mainly under-refinement.


Test 3 — strict Lightning/LightX2V branch

Only use this if you are using matching I2V Lightning/LightX2V LoRAs or a proper distilled LightX2V setup.

Use:
  matching I2V Lightning/LightX2V LoRAs
  LoRA strength: 1.0 each

Remove:
  both ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 4
  cfg_high_noise: 1.0
  cfg_low_noise: 1.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Video:
  33 frames
  512-640px long side

If this still has yellow flashing, suspect:

wrong LoRA pair
T2V LoRA used in I2V
High/Low LoRAs mismatched
wrong VAE
wrong model pair
double shift
workflow node mismatch

11. Recommended settings table

Scenario Boundary Steps CFG high CFG low Sampler Scheduler Shift Notes
Normal High/Low sanity baseline 0.900 12 3.0 3.0 Euler Simple 5.0 Best next test
Normal low-cost test 0.900 8 3.0 3.0 Euler Simple 5.0 Debug only
Strict Lightning/LightX2V 0.900 4 1.0 1.0 Euler Simple 5.0 Only with matching distilled LoRAs/model
Current screenshot 0.750 6 1.5 2.0 Euler Simple 4.0 + external 5.0 Likely unstable hybrid

12. Things I would avoid right now

Avoid this while debugging:

boundary 0.750
steps 6
CFG 1-ish on normal High/Low UNets
external ModelSamplingSD3 shift + WanMoe sigma_shift
720p
49+ frames before baseline works
upscale
interpolation
face restore
multiple LoRAs
cinematic lighting prompts
large camera movement

Especially avoid judging the workflow from this combination:

8GB VRAM
normal High/Low GGUFs
boundary 0.750
steps 6
CFG 1.5 / 2.0
external shift 5
internal sigma_shift 4
large frame count
large resolution

That setup has too many unstable variables.


13. Short answer

Your screenshot has three likely problems:

1. boundary should probably be 0.900, not 0.750, for Wan2.2 I2V
2. six steps is too low for judging the normal High/Low UNets
3. shift is likely being applied ambiguously through both ModelSamplingSD3 and WanMoeKSampler

My recommended reset for the normal High/Low branch:

Remove:
  ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler: euler
  scheduler: simple
  sigma_shift: 5.0
  add_noise: enable
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test:
  33 frames
  512-640px long side
  no LoRAs
  no upscale
  no interpolation
  no face restore

If you are actually using Lightning/LightX2V LoRAs, use the strict distilled branch instead:

boundary: 0.900
steps: 4
cfg_high_noise: 1.0
cfg_low_noise: 1.0
sampler: euler
scheduler: simple
sigma_shift: 5.0
LoRA strength: 1.0
VAE: wan_2.1_vae.safetensors

The blur, crosshatching, and yellow flash are most likely from:

wrong boundary
+ too few steps
+ ambiguous/double shift
+ possibly wrong VAE
+ possibly mixing normal and distilled recipes

14. Useful references

yup this did it. the two sd3’s were the culprits. you nailed it! its working now, no distortions, no artifacts, has better obedience and face permanence. now if i could only fix the quality part. everything has some kind of low def bloom, airbrushed, blended quality to it. this isnt just this specific setups issue. i noticed when i switched from [rapidWAN22I2VGGUF_q4KMRapidBase.gguf] to [wan2.2-i2v-rapid-aio-v10-nsfw-Q4_K.gguf] a week ago. [rapidWAN22I2VGGUF_q4KMRapidBase.gguf] basically kept true to the source image no matter what it was. even low res screengrabs. it just made whatever i fed it move. [wan2.2-i2v-rapid-aio-v10-nsfw-Q4_K.gguf] and the two low/high unets always gave me this weird dream sequence kind of bloom.

i tried the

A Q3_K_M Q3_K_M safest low-VRAM baseline

its stable , no oom, no hick ups. im gonna move forward and test the B option. any setting changes for options A to tweak in order to squeeze more juice out of it before i move to the next step?

i tried them all, unfortunately results were very poor. very ai slop looking, refused to follow complex prompts, halucinated, just not feasible for my setup. im just gonna have to go back to this humble but effective setup that worked surprisingly well. it literally takes any source image and it animates it staying faithful to the quality 1:1. everything else i tried was a bust. however obedience to prompts and sometimes face morph are very hit and miss based on see. one final question and then i promise i stop bothering you. how can i get the best results out of my setup (in the screenshot). since im sticking with this till i get a better gpu i wanna at least squeeze the most out of it. im 100% satisfied with the image quality, its literally like the picture came to life. i just need more obedience and adherence to prompts. and ensuring the face stays the same (thats the biggest issue sometimes. it loses face permanence) which ksampler advanced settings to tweak to get the best result? and finally, is there a free website or some other resource for prompt restructuring? i cant use ollama etc cause it takes too big a bite of vram inside comfy . is there anything you would suggest me to add to my setup? tyvm for all your help

(btw full name of unet is rapidWAN22I2VGGUF_q4KMRapidBase.gguf (cant see it fully in screenshot)

I’m glad the correct answer was included.:laughing:

Hmm… I think I’ve got a pretty good grasp of the situation now. By the way, distilled models like Lightning tend to struggle with accurately reflecting prompt details—especially negative prompts—but there’s still room for improvement. Their responsiveness to positive prompts is actually quite good. Also, if you’re looking for highly complex prompt responses, I think it’s worth considering other variations (if exist).

Distilled models are often created by retraining a model after drastically pruning it, but in the distilled version, parts that shouldn’t be pruned for your specific purpose might have been removed. Well, I guess it can’t be helped if the goal is to save VRAM… But in any case, this means you also have to consider the performance of the model itself—or rather, the inherent characteristics of the distilled model.

By the way, if you’ll use an LLM for prompt refinement, I think using the Gemini or ChatGPT API is the easiest way, but if you want to do it entirely locally, an OSS LLM might be better. For this purpose, I think a smaller model from a high-quality OSS model family is perfectly sufficient. The models provided by Liquid (which includes 1.2B or even 350M variant) run just fine locally on a CPU. Other SOTA models like Qwen 3.5 and Gemma 4 in the 4B class or smaller can also run on a CPU alone. A 4B model is a bit heavy for a CPU, but at least these don’t consume VRAM… they run on RAM. Of course, they’d be very faster if with VRAM!


Wan2.2 RapidBase I2V on 8GB VRAM: getting more prompt obedience without losing source-image fidelity

At this point I would stop chasing the normal High/Low UNet route for this GPU and use rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main workflow.

That is not a downgrade. For the actual goal here — make the source image look like it came to life while preserving the same face, same lighting, same color, same texture, same source quality, and no AI-looking bloom — this model is doing the right kind of thing. The normal High/Low route may be more flexible in theory, but on an 8GB card it is costing too much source fidelity.

The new goal should be:

Keep RapidBase.
Keep source-image fidelity.
Add only mild prompt pressure.
Reduce face morphing.
Avoid turning the workflow into a repainting/generative workflow.

Useful references:


1. Why RapidBase is the right baseline for this specific goal

The High/Low UNet experiment was still useful because it proved one thing: the duplicated SD3 shift setup really was causing artifacts. Removing those conflicting shift nodes fixed distortion and improved obedience/face permanence. But the second lesson is more important:

A technically cleaner High/Low workflow still did not give the desired look.

The preferred model, rapidWAN22I2VGGUF_q4KMRapidBase.gguf, behaves more like a source-preserving animator than a full generative video model. That is exactly why it works well for this use case.

It is good at:

keeping the source image quality
keeping low-res screengrabs looking like themselves
preserving lighting and colors
preserving background
avoiding the airbrushed Wan2.2 dream-sequence look
making the original picture move

It is weaker at:

complex multi-action prompts
large head turns
speaking / mouth motion
hand gestures
strong semantic obedience
large expression changes
camera moves

That tradeoff is expected. A workflow that preserves the source image 1:1 is not going to be as willing to invent new actions. More obedience usually requires more invention; more invention means more risk of face drift.

So the right strategy is not:

force the model to obey huge prompts

The right strategy is:

ask for one small action
add only mild prompt pressure
use seed batching
choose outputs by face permanence first

2. Current control setup

From the screenshot, the current effective workflow is roughly:

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this as the control workflow.

Do not overwrite it. Duplicate it before experiments.

Testing rule:

same image
same prompt
same seed
same frame count
same resolution
change one setting only

If you change CFG, steps, start step, sampler, and prompt at the same time, the result becomes impossible to interpret.


3. Why CFG should stay low

The Rapid/AIO family is explicitly described as a fast all-in-one merge designed around few steps and CFG 1. One README snapshot recommends:

4 steps
1 cfg
sa_solver sampler
beta scheduler

Source: Phr00t Rapid AIO README snapshot

That does not mean the exact best value for your workflow must be exactly 4 steps. Your screenshot already works at 10 steps. But it does mean this model should be tuned like a few-step distilled / rapid model, not like a normal 20-30 step diffusion workflow.

Do not jump to:

cfg: 3.0
cfg: 4.0
cfg: 5.0

That is likely to cause:

face drift
new skin texture
bloom
over-smoothing
changed lighting
new expression
hallucinated details

Use a micro-range instead.


4. CFG test range

Current baseline:

cfg: 1.0

Recommended test values:

1.00
1.15
1.25
1.35
1.50

Interpretation:

CFG Expected behavior
1.00 maximum source fidelity, weakest negative-prompt effect
1.15 tiny prompt pressure
1.25 likely first useful obedience bump
1.35 upper mild test
1.50 stress test for face drift
2.00+ probably too much if face permanence matters

The likely useful zone is:

cfg: 1.15-1.35

Rule:

Use the highest CFG that does not change the face.

Test like this:

Run A:
  cfg: 1.00

Run B:
  cfg: 1.15

Run C:
  cfg: 1.25

Run D:
  cfg: 1.35

Run E:
  cfg: 1.50

Keep everything else identical.

Judge in this order:

1. same face / same identity
2. same source-image quality
3. no morphing
4. no artifacts
5. prompt obedience
6. natural motion

Prompt obedience is not the first priority. A clip that obeys perfectly but changes the face is a failed clip for this workflow.


5. Negative prompts are weak at CFG 1

A common trap is adding a giant negative prompt and expecting it to control the output. In many few-step Wan/Rapid/Lightning-style workflows, CFG 1 means negative prompts are weak or mostly inactive.

The Wan prompting guide explains this directly: in standard diffusion, CFG above 1 gives the model a stronger positive-vs-negative comparison, but in few-step CFG 1 workflows, negative prompts often do little. See How to get the most out of prompts for WAN models.

Practical consequence:

Do not rely on a huge negative prompt.
Put the important preservation rules in the positive prompt.

Positive prompt should explicitly say:

same face
same identity
same hairstyle
same clothing
same lighting
same colors
same camera angle
same background
static camera
no zoom
no scene change
only subtle motion

A short negative prompt is still fine, but it is secondary.


6. start_at_step: test 1 vs 0

Current screenshot:

start_at_step: 1

This may be helping source fidelity. Starting at step 1 can skip a tiny early part of the denoising path, which may reduce repainting.

Test only:

start_at_step: 1
start_at_step: 0

Expected tradeoff:

Setting Likely benefit Risk
1 better source fidelity and face permanence weaker motion / weaker prompt response
0 more motion and prompt response more face drift / more repainting

Suggested test:

Run A:
  cfg: 1.25
  start_at_step: 1
  steps: 10

Run B:
  cfg: 1.25
  start_at_step: 0
  steps: 10

Possible decisions:

Result Keep
0 improves obedience and face stays stable start_at_step: 0
0 gives more motion but face changes start_at_step: 1
no meaningful difference start_at_step: 1
0 adds bloom/repainting start_at_step: 1

My expectation: start_at_step: 1 may remain the safest default.


7. Steps: test 8 / 10 / 12

Current setting:

steps: 10

This may already be close to the sweet spot.

Few-step distilled models do not always improve with more steps. Sometimes extra steps create more smoothing, blending, or repainting.

Test only:

steps: 8
steps: 10
steps: 12

Expected behavior:

Steps Likely behavior
8 faster, possibly more source-faithful, possibly weaker obedience
10 current working baseline
12 may improve smoothness/obedience, but may add bloom or airbrushing
16+ not recommended for this model unless intentionally stress-testing

Suggested test:

Run A:
  steps: 8

Run B:
  steps: 10

Run C:
  steps: 12

Keep the best balance. If 12 adds the “dream sequence” look, go back to 10.


8. return_with_leftover_noise: test once

Current screenshot:

return_with_leftover_noise: enable
end_at_step: 10000

Since end_at_step is far beyond the actual step count, the sampler is probably completing its pass. This setting may not matter much, but test it once.

Run A:
  return_with_leftover_noise: enable

Run B:
  return_with_leftover_noise: disable

Keep whichever preserves the “picture came to life” look.

Do not spend a whole day on this. It is unlikely to be the main obedience or face-permanence control.


9. add_noise: keep enabled

Keep:

add_noise: enable

For image-to-video, the model needs noise to create motion. If you disable it, you may get a more frozen output or odd behavior depending on the rest of the graph.

Only test add_noise: disable if diagnosing a very specific problem:

every seed changes the face
motion is always too aggressive
the image is being repainted too much

Even then, treat it as a diagnostic test, not the likely final setting.


10. Sampler and scheduler: keep sa_solver / beta

Your current best branch uses:

sampler_name: sa_solver
scheduler: beta

Keep that as the main branch.

The Rapid/AIO README snapshot specifically recommends sa_solver and beta for that family. Source: Rapid AIO README snapshot.

If you want to test alternatives, do it only after the CFG/start/steps tests, and keep them as separate branches:

Branch A:
  sa_solver / beta

Branch B:
  euler / beta

Branch C:
  euler_a / beta

Branch D:
  euler / simple

Expected behavior:

Sampler / scheduler Likely behavior
sa_solver / beta best current source-fidelity branch
euler / beta may obey differently, possibly less faithful
euler_a / beta more variation/motion, higher face-drift risk
euler / simple more relevant to Lightning/LightX2V-style workflows

I would not change sampler/scheduler unless the smaller tests fail.


11. Seed batching is now one of the strongest tools

You already noticed face morphing is seed-dependent. That is real.

In video generation, the seed affects:

eye behavior
mouth behavior
micro-expression
small head motion
whether face identity drifts
whether the source texture holds

Use two phases.

Phase A — setting tests

Use one fixed seed:

fixed seed
same image
same prompt
same resolution
same frame count
change one setting only

This tells you what the setting does.

Phase B — production seed search

After choosing settings, run:

8-16 seeds
same image
same prompt
same final settings
short preview first

Pick by this priority:

1. same face / same identity
2. same source-image quality
3. no morphing
4. natural motion
5. prompt obedience
6. no artifacts

For your goal, a seed that keeps the face and obeys 70% is better than a seed that obeys 100% and changes the person.


12. Exact tuning plan

Matrix 0 — save control

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this output as the reference.

Matrix 1 — CFG

cfg: 1.00
cfg: 1.15
cfg: 1.25
cfg: 1.35
cfg: 1.50

Pick the highest CFG that does not alter identity.

Matrix 2 — start step

Use the best CFG.

start_at_step: 1
start_at_step: 0

Keep 1 unless 0 clearly improves obedience without face drift.

Matrix 3 — steps

Use best CFG and best start step.

steps: 8
steps: 10
steps: 12

Keep the one with the least bloom/airbrushing and best face permanence.

Matrix 4 — leftover noise

Use best CFG/start/steps.

return_with_leftover_noise: enable
return_with_leftover_noise: disable

Keep the more source-faithful result.

Matrix 5 — seed batch

Use final settings.

8-16 seeds
short preview
same prompt
same image

Pick the seed by face permanence first.


13. Recommended presets

Preset A — safest source fidelity

Use when the face must stay the same.

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Use for:

portraits
faces
low-res screengrabs
source-quality preservation
subtle motion

Preset B — slightly more obedient

Same as Preset A, except:

cfg: 1.15

Then test:

cfg: 1.25

Stop if the face changes.

Preset C — stronger motion test

Same as Preset A, except:

start_at_step: 0
cfg: 1.15

If the face changes, return to:

start_at_step: 1

Preset D — smoothness test

Same as Preset A, except:

steps: 12

If it adds bloom or airbrushing, return to:

steps: 10

Preset E — faster seed scouting

Same as Preset A, except:

steps: 8
shorter frame count
lower test resolution

Use this only for finding seeds quickly, then rerun good seeds at normal settings.


14. Prompt strategy: one action only

This workflow needs simple prompts.

Bad prompt:

The person turns their head, smiles, raises their hand, looks into the camera, hair moves in the wind, camera slowly zooms in, cinematic lighting.

Why this is bad:

too many actions
requires new expression
requires new pose
requires new hair behavior
requires camera motion
invites lighting changes
increases face drift

Better prompt:

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Best rule:

one generation = one small action

Safe actions:

one subtle blink
gentle breathing
tiny natural smile
slight eye movement
very small head tilt

Risky actions:

speaking
laughing widely
turning head far
walking
dancing
raising hands
hair blowing strongly
camera zoom
camera orbit
lighting change

For this workflow, obedience improves when the requested action is simple enough that the model does not need to repaint the person.


15. Positive prompt templates

Since negative prompts are weak at CFG 1, put preservation constraints in the positive prompt.

Safe source-faithful template

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Natural subtle motion. Sharp face.

Slightly more expressive template

The same person from the source image makes a tiny natural smile while gently breathing. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Minimal template

Same person, same face, same identity, same lighting and background. One subtle blink. Static camera.

Face permanence template

The same person keeps the exact same face and identity throughout the video. Only subtle natural breathing and one small blink. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

The repetition of “same face” and “same identity” is not elegant, but it is useful conditioning.


16. Negative prompt template

Keep it short.

different person, face change, identity change, warped face, distorted eyes, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, blurry face

Optional additions:

extra teeth, melted face, asymmetrical eyes, over-smoothed skin, airbrushed, bloom

Do not spend all your effort on negative prompting. At CFG 1, it may do very little. At CFG 1.15-1.35, it may help slightly, but positive prompt structure and seed selection matter more.

Reference: Wan prompting guide on CFG 1 / negative prompts


17. Handling complex prompts

The model refuses or hallucinates complex prompts because they ask for too many inventions at once.

A complex prompt often includes:

subject action
facial expression
body motion
camera motion
lighting change
background interpretation
style direction

That is too much for a source-faithful RapidBase workflow.

Instead of:

She turns to the camera, smiles, raises her hand, and the camera slowly zooms in.

Use separate clips:

Clip 1:
  same person gently blinks once

Clip 2:
  same person makes a tiny natural smile

Clip 3:
  same person slightly raises one hand, only if the hand is already visible

Do not ask for a hand raise if the hand is not clearly visible in the source image. If the model must invent a hand, it may also invent a new body or face.


18. Face permanence rules

Face permanence is mostly controlled by:

source image clarity
motion size
CFG
start_at_step
seed
prompt complexity
frame count
camera motion

Do:

use clear face images
keep motion small
use static camera
use one action only
keep CFG low
batch seeds
choose face permanence first

Avoid:

large head turns
speaking
wide smiles
looking away then back
hands crossing the face
camera movement
dramatic emotion
lighting changes
long clips before seed selection

The model is most likely to morph the face when asked for mouth/teeth motion, big expression changes, or head rotation. Blinks and breathing are much safer.


19. Should you add nodes?

Main recommendation:

Add almost nothing.

Your current workflow’s value is that it does not repaint too much. Extra nodes can easily destroy that.

Avoid adding during optimization:

face restore
style LoRAs
multiple LoRAs
high-strength LoRAs
upscalers before judging motion
interpolation before judging motion
color correction before judging model behavior

Upscale/interpolation should happen only after you choose:

prompt
seed
settings
motion
face permanence

20. Optional node: NAG

NAG is the one optional control idea that fits the problem.

Why it may help:

the model runs near CFG 1
negative prompts are weak
raising CFG can morph the face
NAG may add negative-prompt-like control without pushing CFG too hard

The ComfyUI-NAG README says NAG restores effective negative prompting in few-step diffusion models and can complement CFG. The NAG project page similarly describes NAG as a method for restoring negative prompting in few-step sampling.

How to test:

copy the workflow
add NAG only in the copy
keep CFG low
use the same seed and prompt
compare against the saved control

Remove it if it causes:

bloom
airbrushing
texture changes
face drift
loss of source quality

Do not make NAG part of the main workflow until it beats the control.


21. LoRAs: only one, only low strength

The Phr00t Rapid/AIO model card notes Wan 2.1 LoRA compatibility and low-noise Wan 2.2 LoRA compatibility, but warns against high-noise Wan 2.2 LoRAs for that family. See Phr00t WAN2.2 Rapid All-in-One.

If testing LoRAs:

one LoRA only
strength 0.15
strength 0.25
strength 0.35

Avoid:

1.0 strength
multiple LoRAs
style LoRAs
high-noise Wan2.2 LoRAs
character LoRAs unless necessary

For this workflow, LoRAs are more likely to hurt source fidelity than help, unless very targeted.


22. Free prompt restructuring resources

Do not run Ollama or a local LLM on the same GPU while using ComfyUI. On an 8GB card, that competes directly with Wan.

Use web tools or CPU-only local tools.

Free web options

Good enough:

ChatGPT Free
Google AI Studio / Gemini

References:

Use one batched request rather than many small requests.


23. Prompt rewriter request template

Paste this into ChatGPT, Gemini, or a local helper.

Rewrite this as a short Wan2.2 image-to-video prompt for a low-VRAM RapidBase workflow.

Rules:
- one small action only
- preserve exact face and identity
- preserve hairstyle, clothing, lighting, colors, camera angle, and background
- static camera
- no zoom
- no pan
- no scene change
- avoid cinematic embellishment
- avoid new details not visible in the source image
- keep it literal and short
- output exactly 3 versions:
  1. safest source-faithful version
  2. slightly more expressive version
  3. shortest version

Original idea:
<put idea here>

In normal prose, refer to the placeholder as <put idea here>. Inside code blocks, use raw <put idea here>.

This is better than asking “make the prompt better,” because “better” usually means more cinematic, more detailed, and more inventive — exactly what you do not want.


24. CPU-only local prompt helper

A local helper is optional.

Goal:

rewrite prompts
do not use GPU VRAM
do not compete with ComfyUI

A good tiny local option is LFM2.5-1.2B-Instruct-GGUF. LiquidAI’s docs explain that LFM models are available in GGUF format for llama.cpp-style use: LiquidAI llama.cpp deployment guide.

Example CPU-only server command:

llama-server \
  -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
  -c 2048 \
  -ngl 0 \
  --host 127.0.0.1 \
  --port 8080

Important part:

-ngl 0

The llama.cpp server docs expose GPU layer offload settings through ngl / GPU layer options; setting GPU layers to zero is the relevant CPU-only principle. See llama.cpp server README.

Recommended order:

1. ChatGPT Free or Gemini
2. LFM2.5-1.2B Q4_K_M CPU-only
3. Qwen 2B-4B CPU-only if you want smarter rewriting
4. larger local models only if you have spare CPU/RAM

25. Prompt helper system prompt

Use this as the system prompt in ChatGPT, Gemini, LFM, Qwen, or any prompt helper.

You are a prompt rewriting assistant for Wan2.2 image-to-video.

Rewrite the user's idea into a short, literal, source-faithful I2V prompt.

Rules:
- Use one small action only.
- Preserve the exact same face and identity.
- Preserve hairstyle, clothing, lighting, colors, camera angle, and background.
- Keep the camera static.
- No zoom.
- No pan.
- No scene change.
- No cinematic embellishment.
- No new objects.
- Avoid talking, dancing, walking, large head turns, and large expression changes.
- Prefer subtle motion: blink, gentle breathing, tiny smile, very small eye movement.

Output exactly:
1. Safest:
2. Slightly more expressive:
3. Shortest:

Do not explain.

Then give it:

Rewrite this idea for Wan2.2 I2V:

<your idea>

Example input:

make her look at the camera and smile a bit, maybe some hair movement

Expected output style:

1. Safest:
The same person from the source image gently blinks once and makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

2. Slightly more expressive:
The same person from the source image looks naturally toward the camera and makes a very small smile. Preserve the same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Only subtle natural motion. Static camera.

3. Shortest:
Same person, same face and identity. One subtle blink and tiny smile. Static camera. Same lighting and background.

26. What I would do next

  1. Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main branch.
  2. Save the current workflow as the control.
  3. Test CFG 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  4. Test start_at_step: 1 vs 0.
  5. Test steps: 8 / 10 / 12.
  6. Test return_with_leftover_noise: enable vs disable once.
  7. Use seed batches after choosing settings.
  8. Use one-action prompts.
  9. Put preservation constraints in the positive prompt.
  10. Try NAG only in a duplicate workflow if negative prompting remains weak.
  11. Use ChatGPT/Gemini or CPU-only LFM2.5 for prompt rewriting, not a GPU LLM inside ComfyUI.

Short summary

  • Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf; it matches the source-fidelity goal.
  • Keep sa_solver / beta as the main branch.
  • Do not chase CFG 3+.
  • Test CFG only in a tiny range: 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  • Test start_at_step: 1 versus 0.
  • Test steps: 8 / 10 / 12.
  • Use seed batches; face permanence is seed-sensitive.
  • At CFG 1, negative prompts are weak. Put identity/background/camera constraints in the positive prompt.
  • Use one small action per prompt.
  • Add almost nothing to the workflow. NAG is the only optional control node worth testing, and only in a copy.
  • For prompt rewriting, use ChatGPT Free, Gemini/AI Studio, or a CPU-only tiny model like LFM2.5.

thanks man, i been experimenting little by little as you said. i achieved close to perfect face permanence and obedience and nsfw without loras. its still trial and error and things start to break when i push with too many specific requests and details in one scene, its a balancing act but im getting there little by little. question : is there any way to formulate the prompts that would imply . do action A, after action A is done do action B? does it understand some kind of sequential instructions like these? also, what is your thoughts on (((weighted prompts:1.9))) for this? any effect whatsoever?

is there any way to formulate the prompts that would imply . do action A, after action A is done do action B?

There seem to be several methods available, but some of them are difficult to use in an 8GB VRAM environment:


Wan2.2 RapidBase I2V: sequential actions, prompt weights, and continuity-safe A → B workflows

Short answer:

Yes, you can write prompts like “do action A, then after A is done, do action B.” The model can understand that language. But in a normal single-prompt I2V workflow, that instruction is usually a soft temporal suggestion, not a reliable frame-accurate command.

For the current RapidBase workflow, the safest ranking is:

1. One-clip two-beat prompt
2. Two clips with a handoff frame
3. Neutral overlap + short crossfade
4. FLF2V bridge clip
5. Prompt Relay
6. Prompt Schedule / FizzNodes

For prompt weights:

Avoid:
  (((action:1.9)))

Prefer:
  (same face and identity:1.10)
  (preserve exact face:1.10)
  (static camera:1.10)
  (tiny natural smile:1.05)

The key rule is:

Weight preservation more than action.

The current workflow is working because it preserves the source image. Anything that pushes too hard toward complex action can also push the model into repainting, hallucination, or face drift.


1. Does the model understand “A, then B”?

It can understand the wording, but it does not necessarily execute it as an exact timeline.

A prompt like this is understandable:

The same person first blinks once, then after a brief pause makes a tiny natural smile.

But in a normal I2V generation, the text prompt conditions the whole clip. It is not automatically split into exact frame ranges like:

frames 0-16:
  action A

frames 17-33:
  action B

So the model may interpret “first A, then B” loosely.

Possible outcomes:

Prompt Possible model behavior
blink once, then smile blink and smile happen in the right order
blink once, then smile smile starts before the blink finishes
blink once, then smile only the smile happens
look down, then look back gaze drifts vaguely instead of following exact order
A then B then C one action is skipped or the face starts drifting

This is normal for a single-prompt video model. The model sees the whole instruction, but it is not a strict animation timeline unless you use timeline-control tools.

Useful background:


2. Best first method: one-clip two-beat prompting

For the current RapidBase workflow, this is the best first method.

It does not add nodes, LoRAs, bridge models, scheduling tools, or extra VRAM load. It also protects the main thing the current setup is good at:

same source image
same face
same lighting
same texture
same background
low hallucination

The limitation is that A → B order is only approximate.

Good use cases

Use one-clip two-beat prompts for small actions:

blink once -> tiny smile
gentle breathing -> blink once
look slightly downward -> return eyes to camera
tiny smile -> neutral expression
eyes shift slightly left -> eyes return to camera
neutral expression -> tiny smile

Bad use cases

Avoid large or multi-stage sequences:

turn head -> talk -> raise hand
walk forward -> gesture -> camera zooms in
look away -> laugh -> turn back
large smile -> speaking -> hair blowing
pose change -> lighting change -> background reaction

Each extra action increases the chance of:

face drift
changed mouth shape
changed eye shape
new lighting
new camera angle
background mutation
AI-looking repainting

3. Good wording for “after A is done, do B”

Use completion language, not just a loose list.

Weak:

blink and smile

Better:

first blinks once, then after the blink is complete, slowly forms a tiny natural smile

Good sequencing phrases:

first <action A>, then after a brief pause <action B>
after <action A> is complete, <action B>
begins still, then <action A>, then settles into <action B>
first holds a neutral expression, then gradually <action B>
after returning to neutral, <action B>

Avoid vague or overloaded phrasing:

blink and smile naturally
perform a sequence of expressions
react emotionally
do a cute expression
move seductively
act naturally

Vague words invite the model to improvise. Improvisation is where identity drift usually starts.


4. Practical one-clip formula

Use this structure:

[identity lock] + [starting state] + [action A] + [pause/settle] + [action B] + [camera lock] + [scene lock]

Example:

The same person from the source image keeps the exact same face and identity. The video begins with a calm neutral expression. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change.

This is better than:

she blinks then smiles

because it tells the model:

who must remain the same
what state to start from
what action comes first
what happens after
what must not change

5. One-clip two-beat prompt templates

Safest A → B template

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Subtle natural motion only.

More explicit timing template

The video begins with the same person holding still. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first template

Preserve the exact same face and identity throughout the video. The same person first blinks once, then after a brief pause makes a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Short template

Same person, same face and identity. First one subtle blink, then a tiny natural smile. Static camera. Same lighting and background.

Very safe template

The same person keeps the exact same face and identity throughout the video. First, one small blink. Then, a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

6. When a single prompt is not enough

If exact order matters, use two clips.

Do not do this:

Clip 1:
  original source image -> action A

Clip 2:
  original source image -> action B

That creates two independent clips from the same original starting point. Clip 2 does not know where Clip 1 ended.

Better:

Clip 1:
  original source image -> action A only

Handoff frame:
  clean stable frame near the end of Clip 1

Clip 2:
  handoff frame -> action B only

This is the most practical way to get reliable A → B ordering without adding complex nodes.


7. Handoff-frame workflow

Process

1. Generate Clip 1 with action A only.
2. Inspect the last 3-10 frames.
3. Do not blindly use the final frame.
4. Pick the cleanest stable frame:
   - best face
   - least blur
   - stable lighting
   - stable background
   - expression suitable for the next action
5. Save that frame as PNG.
6. Use it as the source image for Clip 2.
7. Prompt Clip 2 for action B only.
8. Keep settings consistent:
   - same resolution
   - same FPS
   - same VAE
   - same text encoder
   - same sampler
   - same scheduler
   - same CFG
   - same steps
   - same prompt style

Clip 1 example

The same person gently blinks once, then returns to a calm neutral expression. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

Clip 2 example

The same person begins from a calm neutral expression, then slowly forms a tiny natural smile. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

This is more reliable than trying to force a complex sequence into one prompt.


8. Neutral overlap and crossfade

If using two clips, make the join happen during a neutral moment.

Bad join:

Clip 1 ends during a blink.
Clip 2 starts with a smile.

Better join:

Clip 1 ends after returning to neutral.
Clip 2 starts from the neutral handoff frame.

If the join is slightly visible, use a short crossfade.

Typical overlap:

4-8 frames
same FPS
same resolution
same color settings
same encoding settings

FFmpeg example:

ffmpeg \
  -i clip1.mp4 \
  -i clip2.mp4 \
  -filter_complex "xfade=transition=fade:duration=0.25:offset=2.75" \
  -c:v libx264 -crf 18 -preset slow \
  output.mp4

offset must be adjusted to match the length of Clip 1 and the desired transition point.

Reference:

Important:

A crossfade can hide a small seam. It cannot fix a true face, lighting, or background mismatch.

If the face is different between the clips, a crossfade may create ghosting or a double-face dissolve.


9. FLF2V bridge clip

FLF2V means First-Last Frame to Video.

Instead of simply crossfading Clip A into Clip B, you provide:

first frame = stable end frame of Clip A
last frame  = stable start frame of Clip B

Then the model generates the transition between them.

Concept:

Clip A:
  source -> action A

Clip B:
  handoff/source -> action B

Bridge:
  first frame = stable end frame of Clip A
  last frame  = stable start frame of Clip B
  prompt      = smooth subtle transition, same face, same lighting, static camera

Why it can help:

more natural transition than crossfade
can reduce a sudden jump between two clips
uses actual visual endpoints

Why it may not be ideal for the current 8GB RapidBase workflow:

separate workflow family
may be heavier
may not preserve the same RapidBase look
may introduce bloom or airbrushed style
may require more setup and testing

Use FLF2V only if:

Clip A is good.
Clip B is good.
The join is visibly bad.
A simple crossfade is not good enough.
The bridge can be short.

References:


10. Prompt Relay

Prompt Relay is closer to the real solution for “A happens in one segment, B happens in another segment.”

Instead of relying on a single prompt, Prompt Relay routes different prompts through different temporal segments.

Concept:

Global prompt:
  same person, same face, same identity, same lighting, same background, static camera

Segment 1:
  blink once

Segment 2:
  tiny natural smile

Why it is attractive:

A and B happen inside one timeline
less independent-clip continuity drift
global identity/camera constraints can stay active
different segments can receive different action prompts

Why it should be treated carefully:

changes the workflow structure
may not plug cleanly into the current RapidBase GGUF workflow
may increase complexity
8GB behavior is uncertain
could break the source-fidelity look

Do not add it to the working workflow directly. Test only in a duplicate workflow.

References:


11. Prompt Schedule / FizzNodes

Prompt scheduling is the general concept of changing prompt conditioning over time.

Concept:

Frames 0-16:
  same person gently blinks once

Frames 17-33:
  same person slowly forms a tiny smile

Why it may help:

more explicit temporal control
frame/segment-based prompt changes
better than hoping a single prompt follows order

Why it is not the first recommendation here:

not guaranteed to fit the current RapidBase GGUF workflow
can change conditioning behavior
may break the current source-fidelity look
adds complexity

References:


12. Recommended method ranking

Rank Method A → B reliability Source fidelity 8GB friendliness Recommendation
1 One-clip two-beat prompt Medium High High Try first
2 Two clips + handoff frame High Medium-high High Best practical method
3 Two clips + neutral crossfade Medium-high Medium-high High Good polish
4 FLF2V bridge High for transition Medium Medium-low Separate experiment
5 Prompt Relay High conceptually Unknown Unknown Advanced experiment
6 Prompt Schedule / FizzNodes Medium-high conceptually Unknown Medium Experimental

Best practical rule:

simple A -> B:
  use one-clip two-beat prompt

strict A -> B:
  use two clips with a handoff frame

smooth transition:
  use handoff frame + optional crossfade

true timeline control:
  test Prompt Relay or Prompt Schedule only in a duplicate workflow

13. Prompt weights: do they work?

Yes, ComfyUI prompt weights can work.

Common syntax:

(phrase:1.2)

Plain parentheses also increase weight. ComfyUI’s CLIPTextEncode documentation says plain parentheses apply a default weight of 1.1, and the ComfyUI Community Manual says nested weights multiply.

Examples:

(phrase)
  roughly increases emphasis

(phrase:1.2)
  explicit weight

((phrase:1.2):0.5)
  nested weights multiply

References:


14. Is (((weighted prompts:1.9))) useful here?

Probably not. For this workflow, it is more likely to hurt than help.

Avoid:

(((turns head and smiles:1.9)))

Avoid:

(((first blinks then smiles:1.9)))

Avoid:

(((action A then action B:1.9)))

Why? Because a huge action weight tells the model:

This action matters more than preserving the source image.

That can cause:

face drift
changed facial geometry
changed skin texture
changed lighting
hallucinated details
mouth/teeth weirdness
background changes
overcooked motion
loss of source fidelity

The current RapidBase workflow works because it is conservative. Heavy action weights fight that.


15. Better prompt-weight strategy

Do not heavily weight the action. Lightly weight preservation.

Better:

The same person first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Risky:

The same person (((first blinks then smiles:1.9))). Same face and background.

The first prompt says:

identity and camera are important
action is small
do not repaint

The second prompt says:

force this action even if the model has to invent

For face permanence, that is the wrong priority.


16. Suggested weight ranges

Weight Use
1.00 normal baseline
1.05 tiny emphasis
1.10 safe emphasis
1.15 useful emphasis for identity/static camera
1.20 upper normal test
1.25 mild stress test
1.35 risky; use sparingly
1.50+ likely too strong
1.90 avoid for source-faithful I2V

For this workflow, use mostly:

1.05-1.20

Maybe test:

1.25

Avoid:

1.50+
1.90
triple-parentheses action forcing

17. What to weight

Good things to weight

(same face and identity:1.10)
(preserve exact face:1.10)
(preserve source image:1.10)
(static camera:1.10)
(no scene change:1.05)
(same lighting and background:1.10)
(tiny natural smile:1.05)
(one subtle blink:1.05)

Risky things to weight

(turns head:1.4)
(speaks:1.4)
(laughs widely:1.4)
(raises hand:1.4)
(hair blowing:1.4)
(camera zooms in:1.4)

Very risky

(((wide smile:1.9)))
(((speaking:1.9)))
(((turning head:1.9)))
(((complex action sequence:1.9)))

Large facial motion and mouth motion are exactly where face permanence usually breaks.


18. Weighted A → B examples

Safe weighted A → B prompt

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Slightly stronger action prompt

The same person begins still, then (gently blinks once:1.05), then slowly forms a (tiny natural smile:1.10). Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first prompt

(Preserve the exact same face and identity:1.15). The same person first blinks once, then slowly forms a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Minimal weighted prompt

(Same face and identity:1.15). First one subtle blink, then a tiny smile. Same lighting and background. (Static camera:1.10).

19. What not to do

Avoid this:

(((The person first blinks, then smiles, then turns their head, then speaks:1.9)))

That stacks three problems:

too many actions
too much weight
weighting the part that causes identity drift

Also avoid:

First she blinks, then smiles, then speaks, then turns her head, while the camera zooms in and the lighting becomes cinematic.

That asks the model to solve:

facial motion
mouth motion
head rotation
camera motion
lighting change
identity preservation
background stability

That is too much for a source-faithful RapidBase clip.


20. Practical A → B workflow

Step 1 — choose the smallest version of the action

Instead of:

turns head and smiles

use:

tiny eye movement and tiny smile

Instead of:

speaks

use:

subtle mouth movement

Instead of:

laughs

use:

tiny natural smile

Step 2 — write a two-beat prompt

The same person first <action A>, then after a brief pause <action B>. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Step 3 — add light preservation weights

(Preserve the exact same face and identity:1.15)
(Static camera:1.10)

Step 4 — batch seeds

8-16 seeds
same prompt
same settings
short preview

Pick by:

1. face permanence
2. correct order
3. natural motion
4. prompt obedience

Step 5 — split into two clips if order fails

If the model keeps blending A and B, use:

Clip 1:
  action A only

Handoff:
  clean stable frame near the end of Clip 1

Clip 2:
  action B only

21. Best practical recommendation

For the current RapidBase workflow:

Use one-clip two-beat prompts first.
Use "first A, then after a brief pause B."
Keep A and B very small.
Batch seeds.
Weight preservation, not action.
Avoid 1.9 weights.
Use handoff-frame two-clip generation when strict order matters.
Only test Prompt Relay / scheduling in a duplicate workflow.

22. Example final prompts

Blink → smile

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Look down → return gaze

The same person from the source image first looks slightly downward with only a tiny eye movement, then returns the eyes to the camera. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Neutral → tiny smile

The video begins with the same person holding a calm neutral expression. Then the person slowly forms a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Breathing → blink

The same person from the source image keeps the exact same face and identity. The person gently breathes with subtle natural motion, then blinks once after a brief pause. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

23. Reference links

Wan prompting / CFG behavior

Rapid / AIO background

FLF2V / Wan2.2 workflows

Prompt Relay / timeline control

Prompt scheduling

Prompt weighting

Crossfade