Wan2.2 i2v (clarifications needed regarding settings on low vram system)

I’m glad the correct answer was included.:laughing:

Hmm… I think I’ve got a pretty good grasp of the situation now. By the way, distilled models like Lightning tend to struggle with accurately reflecting prompt details—especially negative prompts—but there’s still room for improvement. Their responsiveness to positive prompts is actually quite good. Also, if you’re looking for highly complex prompt responses, I think it’s worth considering other variations (if exist).

Distilled models are often created by retraining a model after drastically pruning it, but in the distilled version, parts that shouldn’t be pruned for your specific purpose might have been removed. Well, I guess it can’t be helped if the goal is to save VRAM… But in any case, this means you also have to consider the performance of the model itself—or rather, the inherent characteristics of the distilled model.

By the way, if you’ll use an LLM for prompt refinement, I think using the Gemini or ChatGPT API is the easiest way, but if you want to do it entirely locally, an OSS LLM might be better. For this purpose, I think a smaller model from a high-quality OSS model family is perfectly sufficient. The models provided by Liquid (which includes 1.2B or even 350M variant) run just fine locally on a CPU. Other SOTA models like Qwen 3.5 and Gemma 4 in the 4B class or smaller can also run on a CPU alone. A 4B model is a bit heavy for a CPU, but at least these don’t consume VRAM… they run on RAM. Of course, they’d be very faster if with VRAM!


Wan2.2 RapidBase I2V on 8GB VRAM: getting more prompt obedience without losing source-image fidelity

At this point I would stop chasing the normal High/Low UNet route for this GPU and use rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main workflow.

That is not a downgrade. For the actual goal here — make the source image look like it came to life while preserving the same face, same lighting, same color, same texture, same source quality, and no AI-looking bloom — this model is doing the right kind of thing. The normal High/Low route may be more flexible in theory, but on an 8GB card it is costing too much source fidelity.

The new goal should be:

Keep RapidBase.
Keep source-image fidelity.
Add only mild prompt pressure.
Reduce face morphing.
Avoid turning the workflow into a repainting/generative workflow.

Useful references:


1. Why RapidBase is the right baseline for this specific goal

The High/Low UNet experiment was still useful because it proved one thing: the duplicated SD3 shift setup really was causing artifacts. Removing those conflicting shift nodes fixed distortion and improved obedience/face permanence. But the second lesson is more important:

A technically cleaner High/Low workflow still did not give the desired look.

The preferred model, rapidWAN22I2VGGUF_q4KMRapidBase.gguf, behaves more like a source-preserving animator than a full generative video model. That is exactly why it works well for this use case.

It is good at:

keeping the source image quality
keeping low-res screengrabs looking like themselves
preserving lighting and colors
preserving background
avoiding the airbrushed Wan2.2 dream-sequence look
making the original picture move

It is weaker at:

complex multi-action prompts
large head turns
speaking / mouth motion
hand gestures
strong semantic obedience
large expression changes
camera moves

That tradeoff is expected. A workflow that preserves the source image 1:1 is not going to be as willing to invent new actions. More obedience usually requires more invention; more invention means more risk of face drift.

So the right strategy is not:

force the model to obey huge prompts

The right strategy is:

ask for one small action
add only mild prompt pressure
use seed batching
choose outputs by face permanence first

2. Current control setup

From the screenshot, the current effective workflow is roughly:

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this as the control workflow.

Do not overwrite it. Duplicate it before experiments.

Testing rule:

same image
same prompt
same seed
same frame count
same resolution
change one setting only

If you change CFG, steps, start step, sampler, and prompt at the same time, the result becomes impossible to interpret.


3. Why CFG should stay low

The Rapid/AIO family is explicitly described as a fast all-in-one merge designed around few steps and CFG 1. One README snapshot recommends:

4 steps
1 cfg
sa_solver sampler
beta scheduler

Source: Phr00t Rapid AIO README snapshot

That does not mean the exact best value for your workflow must be exactly 4 steps. Your screenshot already works at 10 steps. But it does mean this model should be tuned like a few-step distilled / rapid model, not like a normal 20-30 step diffusion workflow.

Do not jump to:

cfg: 3.0
cfg: 4.0
cfg: 5.0

That is likely to cause:

face drift
new skin texture
bloom
over-smoothing
changed lighting
new expression
hallucinated details

Use a micro-range instead.


4. CFG test range

Current baseline:

cfg: 1.0

Recommended test values:

1.00
1.15
1.25
1.35
1.50

Interpretation:

CFG Expected behavior
1.00 maximum source fidelity, weakest negative-prompt effect
1.15 tiny prompt pressure
1.25 likely first useful obedience bump
1.35 upper mild test
1.50 stress test for face drift
2.00+ probably too much if face permanence matters

The likely useful zone is:

cfg: 1.15-1.35

Rule:

Use the highest CFG that does not change the face.

Test like this:

Run A:
  cfg: 1.00

Run B:
  cfg: 1.15

Run C:
  cfg: 1.25

Run D:
  cfg: 1.35

Run E:
  cfg: 1.50

Keep everything else identical.

Judge in this order:

1. same face / same identity
2. same source-image quality
3. no morphing
4. no artifacts
5. prompt obedience
6. natural motion

Prompt obedience is not the first priority. A clip that obeys perfectly but changes the face is a failed clip for this workflow.


5. Negative prompts are weak at CFG 1

A common trap is adding a giant negative prompt and expecting it to control the output. In many few-step Wan/Rapid/Lightning-style workflows, CFG 1 means negative prompts are weak or mostly inactive.

The Wan prompting guide explains this directly: in standard diffusion, CFG above 1 gives the model a stronger positive-vs-negative comparison, but in few-step CFG 1 workflows, negative prompts often do little. See How to get the most out of prompts for WAN models.

Practical consequence:

Do not rely on a huge negative prompt.
Put the important preservation rules in the positive prompt.

Positive prompt should explicitly say:

same face
same identity
same hairstyle
same clothing
same lighting
same colors
same camera angle
same background
static camera
no zoom
no scene change
only subtle motion

A short negative prompt is still fine, but it is secondary.


6. start_at_step: test 1 vs 0

Current screenshot:

start_at_step: 1

This may be helping source fidelity. Starting at step 1 can skip a tiny early part of the denoising path, which may reduce repainting.

Test only:

start_at_step: 1
start_at_step: 0

Expected tradeoff:

Setting Likely benefit Risk
1 better source fidelity and face permanence weaker motion / weaker prompt response
0 more motion and prompt response more face drift / more repainting

Suggested test:

Run A:
  cfg: 1.25
  start_at_step: 1
  steps: 10

Run B:
  cfg: 1.25
  start_at_step: 0
  steps: 10

Possible decisions:

Result Keep
0 improves obedience and face stays stable start_at_step: 0
0 gives more motion but face changes start_at_step: 1
no meaningful difference start_at_step: 1
0 adds bloom/repainting start_at_step: 1

My expectation: start_at_step: 1 may remain the safest default.


7. Steps: test 8 / 10 / 12

Current setting:

steps: 10

This may already be close to the sweet spot.

Few-step distilled models do not always improve with more steps. Sometimes extra steps create more smoothing, blending, or repainting.

Test only:

steps: 8
steps: 10
steps: 12

Expected behavior:

Steps Likely behavior
8 faster, possibly more source-faithful, possibly weaker obedience
10 current working baseline
12 may improve smoothness/obedience, but may add bloom or airbrushing
16+ not recommended for this model unless intentionally stress-testing

Suggested test:

Run A:
  steps: 8

Run B:
  steps: 10

Run C:
  steps: 12

Keep the best balance. If 12 adds the “dream sequence” look, go back to 10.


8. return_with_leftover_noise: test once

Current screenshot:

return_with_leftover_noise: enable
end_at_step: 10000

Since end_at_step is far beyond the actual step count, the sampler is probably completing its pass. This setting may not matter much, but test it once.

Run A:
  return_with_leftover_noise: enable

Run B:
  return_with_leftover_noise: disable

Keep whichever preserves the “picture came to life” look.

Do not spend a whole day on this. It is unlikely to be the main obedience or face-permanence control.


9. add_noise: keep enabled

Keep:

add_noise: enable

For image-to-video, the model needs noise to create motion. If you disable it, you may get a more frozen output or odd behavior depending on the rest of the graph.

Only test add_noise: disable if diagnosing a very specific problem:

every seed changes the face
motion is always too aggressive
the image is being repainted too much

Even then, treat it as a diagnostic test, not the likely final setting.


10. Sampler and scheduler: keep sa_solver / beta

Your current best branch uses:

sampler_name: sa_solver
scheduler: beta

Keep that as the main branch.

The Rapid/AIO README snapshot specifically recommends sa_solver and beta for that family. Source: Rapid AIO README snapshot.

If you want to test alternatives, do it only after the CFG/start/steps tests, and keep them as separate branches:

Branch A:
  sa_solver / beta

Branch B:
  euler / beta

Branch C:
  euler_a / beta

Branch D:
  euler / simple

Expected behavior:

Sampler / scheduler Likely behavior
sa_solver / beta best current source-fidelity branch
euler / beta may obey differently, possibly less faithful
euler_a / beta more variation/motion, higher face-drift risk
euler / simple more relevant to Lightning/LightX2V-style workflows

I would not change sampler/scheduler unless the smaller tests fail.


11. Seed batching is now one of the strongest tools

You already noticed face morphing is seed-dependent. That is real.

In video generation, the seed affects:

eye behavior
mouth behavior
micro-expression
small head motion
whether face identity drifts
whether the source texture holds

Use two phases.

Phase A — setting tests

Use one fixed seed:

fixed seed
same image
same prompt
same resolution
same frame count
change one setting only

This tells you what the setting does.

Phase B — production seed search

After choosing settings, run:

8-16 seeds
same image
same prompt
same final settings
short preview first

Pick by this priority:

1. same face / same identity
2. same source-image quality
3. no morphing
4. natural motion
5. prompt obedience
6. no artifacts

For your goal, a seed that keeps the face and obeys 70% is better than a seed that obeys 100% and changes the person.


12. Exact tuning plan

Matrix 0 — save control

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this output as the reference.

Matrix 1 — CFG

cfg: 1.00
cfg: 1.15
cfg: 1.25
cfg: 1.35
cfg: 1.50

Pick the highest CFG that does not alter identity.

Matrix 2 — start step

Use the best CFG.

start_at_step: 1
start_at_step: 0

Keep 1 unless 0 clearly improves obedience without face drift.

Matrix 3 — steps

Use best CFG and best start step.

steps: 8
steps: 10
steps: 12

Keep the one with the least bloom/airbrushing and best face permanence.

Matrix 4 — leftover noise

Use best CFG/start/steps.

return_with_leftover_noise: enable
return_with_leftover_noise: disable

Keep the more source-faithful result.

Matrix 5 — seed batch

Use final settings.

8-16 seeds
short preview
same prompt
same image

Pick the seed by face permanence first.


13. Recommended presets

Preset A — safest source fidelity

Use when the face must stay the same.

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Use for:

portraits
faces
low-res screengrabs
source-quality preservation
subtle motion

Preset B — slightly more obedient

Same as Preset A, except:

cfg: 1.15

Then test:

cfg: 1.25

Stop if the face changes.

Preset C — stronger motion test

Same as Preset A, except:

start_at_step: 0
cfg: 1.15

If the face changes, return to:

start_at_step: 1

Preset D — smoothness test

Same as Preset A, except:

steps: 12

If it adds bloom or airbrushing, return to:

steps: 10

Preset E — faster seed scouting

Same as Preset A, except:

steps: 8
shorter frame count
lower test resolution

Use this only for finding seeds quickly, then rerun good seeds at normal settings.


14. Prompt strategy: one action only

This workflow needs simple prompts.

Bad prompt:

The person turns their head, smiles, raises their hand, looks into the camera, hair moves in the wind, camera slowly zooms in, cinematic lighting.

Why this is bad:

too many actions
requires new expression
requires new pose
requires new hair behavior
requires camera motion
invites lighting changes
increases face drift

Better prompt:

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Best rule:

one generation = one small action

Safe actions:

one subtle blink
gentle breathing
tiny natural smile
slight eye movement
very small head tilt

Risky actions:

speaking
laughing widely
turning head far
walking
dancing
raising hands
hair blowing strongly
camera zoom
camera orbit
lighting change

For this workflow, obedience improves when the requested action is simple enough that the model does not need to repaint the person.


15. Positive prompt templates

Since negative prompts are weak at CFG 1, put preservation constraints in the positive prompt.

Safe source-faithful template

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Natural subtle motion. Sharp face.

Slightly more expressive template

The same person from the source image makes a tiny natural smile while gently breathing. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Minimal template

Same person, same face, same identity, same lighting and background. One subtle blink. Static camera.

Face permanence template

The same person keeps the exact same face and identity throughout the video. Only subtle natural breathing and one small blink. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

The repetition of “same face” and “same identity” is not elegant, but it is useful conditioning.


16. Negative prompt template

Keep it short.

different person, face change, identity change, warped face, distorted eyes, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, blurry face

Optional additions:

extra teeth, melted face, asymmetrical eyes, over-smoothed skin, airbrushed, bloom

Do not spend all your effort on negative prompting. At CFG 1, it may do very little. At CFG 1.15-1.35, it may help slightly, but positive prompt structure and seed selection matter more.

Reference: Wan prompting guide on CFG 1 / negative prompts


17. Handling complex prompts

The model refuses or hallucinates complex prompts because they ask for too many inventions at once.

A complex prompt often includes:

subject action
facial expression
body motion
camera motion
lighting change
background interpretation
style direction

That is too much for a source-faithful RapidBase workflow.

Instead of:

She turns to the camera, smiles, raises her hand, and the camera slowly zooms in.

Use separate clips:

Clip 1:
  same person gently blinks once

Clip 2:
  same person makes a tiny natural smile

Clip 3:
  same person slightly raises one hand, only if the hand is already visible

Do not ask for a hand raise if the hand is not clearly visible in the source image. If the model must invent a hand, it may also invent a new body or face.


18. Face permanence rules

Face permanence is mostly controlled by:

source image clarity
motion size
CFG
start_at_step
seed
prompt complexity
frame count
camera motion

Do:

use clear face images
keep motion small
use static camera
use one action only
keep CFG low
batch seeds
choose face permanence first

Avoid:

large head turns
speaking
wide smiles
looking away then back
hands crossing the face
camera movement
dramatic emotion
lighting changes
long clips before seed selection

The model is most likely to morph the face when asked for mouth/teeth motion, big expression changes, or head rotation. Blinks and breathing are much safer.


19. Should you add nodes?

Main recommendation:

Add almost nothing.

Your current workflow’s value is that it does not repaint too much. Extra nodes can easily destroy that.

Avoid adding during optimization:

face restore
style LoRAs
multiple LoRAs
high-strength LoRAs
upscalers before judging motion
interpolation before judging motion
color correction before judging model behavior

Upscale/interpolation should happen only after you choose:

prompt
seed
settings
motion
face permanence

20. Optional node: NAG

NAG is the one optional control idea that fits the problem.

Why it may help:

the model runs near CFG 1
negative prompts are weak
raising CFG can morph the face
NAG may add negative-prompt-like control without pushing CFG too hard

The ComfyUI-NAG README says NAG restores effective negative prompting in few-step diffusion models and can complement CFG. The NAG project page similarly describes NAG as a method for restoring negative prompting in few-step sampling.

How to test:

copy the workflow
add NAG only in the copy
keep CFG low
use the same seed and prompt
compare against the saved control

Remove it if it causes:

bloom
airbrushing
texture changes
face drift
loss of source quality

Do not make NAG part of the main workflow until it beats the control.


21. LoRAs: only one, only low strength

The Phr00t Rapid/AIO model card notes Wan 2.1 LoRA compatibility and low-noise Wan 2.2 LoRA compatibility, but warns against high-noise Wan 2.2 LoRAs for that family. See Phr00t WAN2.2 Rapid All-in-One.

If testing LoRAs:

one LoRA only
strength 0.15
strength 0.25
strength 0.35

Avoid:

1.0 strength
multiple LoRAs
style LoRAs
high-noise Wan2.2 LoRAs
character LoRAs unless necessary

For this workflow, LoRAs are more likely to hurt source fidelity than help, unless very targeted.


22. Free prompt restructuring resources

Do not run Ollama or a local LLM on the same GPU while using ComfyUI. On an 8GB card, that competes directly with Wan.

Use web tools or CPU-only local tools.

Free web options

Good enough:

ChatGPT Free
Google AI Studio / Gemini

References:

Use one batched request rather than many small requests.


23. Prompt rewriter request template

Paste this into ChatGPT, Gemini, or a local helper.

Rewrite this as a short Wan2.2 image-to-video prompt for a low-VRAM RapidBase workflow.

Rules:
- one small action only
- preserve exact face and identity
- preserve hairstyle, clothing, lighting, colors, camera angle, and background
- static camera
- no zoom
- no pan
- no scene change
- avoid cinematic embellishment
- avoid new details not visible in the source image
- keep it literal and short
- output exactly 3 versions:
  1. safest source-faithful version
  2. slightly more expressive version
  3. shortest version

Original idea:
<put idea here>

In normal prose, refer to the placeholder as <put idea here>. Inside code blocks, use raw <put idea here>.

This is better than asking “make the prompt better,” because “better” usually means more cinematic, more detailed, and more inventive — exactly what you do not want.


24. CPU-only local prompt helper

A local helper is optional.

Goal:

rewrite prompts
do not use GPU VRAM
do not compete with ComfyUI

A good tiny local option is LFM2.5-1.2B-Instruct-GGUF. LiquidAI’s docs explain that LFM models are available in GGUF format for llama.cpp-style use: LiquidAI llama.cpp deployment guide.

Example CPU-only server command:

llama-server \
  -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
  -c 2048 \
  -ngl 0 \
  --host 127.0.0.1 \
  --port 8080

Important part:

-ngl 0

The llama.cpp server docs expose GPU layer offload settings through ngl / GPU layer options; setting GPU layers to zero is the relevant CPU-only principle. See llama.cpp server README.

Recommended order:

1. ChatGPT Free or Gemini
2. LFM2.5-1.2B Q4_K_M CPU-only
3. Qwen 2B-4B CPU-only if you want smarter rewriting
4. larger local models only if you have spare CPU/RAM

25. Prompt helper system prompt

Use this as the system prompt in ChatGPT, Gemini, LFM, Qwen, or any prompt helper.

You are a prompt rewriting assistant for Wan2.2 image-to-video.

Rewrite the user's idea into a short, literal, source-faithful I2V prompt.

Rules:
- Use one small action only.
- Preserve the exact same face and identity.
- Preserve hairstyle, clothing, lighting, colors, camera angle, and background.
- Keep the camera static.
- No zoom.
- No pan.
- No scene change.
- No cinematic embellishment.
- No new objects.
- Avoid talking, dancing, walking, large head turns, and large expression changes.
- Prefer subtle motion: blink, gentle breathing, tiny smile, very small eye movement.

Output exactly:
1. Safest:
2. Slightly more expressive:
3. Shortest:

Do not explain.

Then give it:

Rewrite this idea for Wan2.2 I2V:

<your idea>

Example input:

make her look at the camera and smile a bit, maybe some hair movement

Expected output style:

1. Safest:
The same person from the source image gently blinks once and makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

2. Slightly more expressive:
The same person from the source image looks naturally toward the camera and makes a very small smile. Preserve the same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Only subtle natural motion. Static camera.

3. Shortest:
Same person, same face and identity. One subtle blink and tiny smile. Static camera. Same lighting and background.

26. What I would do next

  1. Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main branch.
  2. Save the current workflow as the control.
  3. Test CFG 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  4. Test start_at_step: 1 vs 0.
  5. Test steps: 8 / 10 / 12.
  6. Test return_with_leftover_noise: enable vs disable once.
  7. Use seed batches after choosing settings.
  8. Use one-action prompts.
  9. Put preservation constraints in the positive prompt.
  10. Try NAG only in a duplicate workflow if negative prompting remains weak.
  11. Use ChatGPT/Gemini or CPU-only LFM2.5 for prompt rewriting, not a GPU LLM inside ComfyUI.

Short summary

  • Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf; it matches the source-fidelity goal.
  • Keep sa_solver / beta as the main branch.
  • Do not chase CFG 3+.
  • Test CFG only in a tiny range: 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  • Test start_at_step: 1 versus 0.
  • Test steps: 8 / 10 / 12.
  • Use seed batches; face permanence is seed-sensitive.
  • At CFG 1, negative prompts are weak. Put identity/background/camera constraints in the positive prompt.
  • Use one small action per prompt.
  • Add almost nothing to the workflow. NAG is the only optional control node worth testing, and only in a copy.
  • For prompt rewriting, use ChatGPT Free, Gemini/AI Studio, or a CPU-only tiny model like LFM2.5.