I’m glad the correct answer was included.
Hmm… I think I’ve got a pretty good grasp of the situation now. By the way, distilled models like Lightning tend to struggle with accurately reflecting prompt details—especially negative prompts—but there’s still room for improvement. Their responsiveness to positive prompts is actually quite good. Also, if you’re looking for highly complex prompt responses, I think it’s worth considering other variations (if exist).
Distilled models are often created by retraining a model after drastically pruning it, but in the distilled version, parts that shouldn’t be pruned for your specific purpose might have been removed. Well, I guess it can’t be helped if the goal is to save VRAM… But in any case, this means you also have to consider the performance of the model itself—or rather, the inherent characteristics of the distilled model.
By the way, if you’ll use an LLM for prompt refinement, I think using the Gemini or ChatGPT API is the easiest way, but if you want to do it entirely locally, an OSS LLM might be better. For this purpose, I think a smaller model from a high-quality OSS model family is perfectly sufficient. The models provided by Liquid (which includes 1.2B or even 350M variant) run just fine locally on a CPU. Other SOTA models like Qwen 3.5 and Gemma 4 in the 4B class or smaller can also run on a CPU alone. A 4B model is a bit heavy for a CPU, but at least these don’t consume VRAM… they run on RAM. Of course, they’d be very faster if with VRAM!
Wan2.2 RapidBase I2V on 8GB VRAM: getting more prompt obedience without losing source-image fidelity
At this point I would stop chasing the normal High/Low UNet route for this GPU and use rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main workflow.
That is not a downgrade. For the actual goal here — make the source image look like it came to life while preserving the same face, same lighting, same color, same texture, same source quality, and no AI-looking bloom — this model is doing the right kind of thing. The normal High/Low route may be more flexible in theory, but on an 8GB card it is costing too much source fidelity.
The new goal should be:
Keep RapidBase.
Keep source-image fidelity.
Add only mild prompt pressure.
Reduce face morphing.
Avoid turning the workflow into a repainting/generative workflow.
Useful references:
1. Why RapidBase is the right baseline for this specific goal
The High/Low UNet experiment was still useful because it proved one thing: the duplicated SD3 shift setup really was causing artifacts. Removing those conflicting shift nodes fixed distortion and improved obedience/face permanence. But the second lesson is more important:
A technically cleaner High/Low workflow still did not give the desired look.
The preferred model, rapidWAN22I2VGGUF_q4KMRapidBase.gguf, behaves more like a source-preserving animator than a full generative video model. That is exactly why it works well for this use case.
It is good at:
keeping the source image quality
keeping low-res screengrabs looking like themselves
preserving lighting and colors
preserving background
avoiding the airbrushed Wan2.2 dream-sequence look
making the original picture move
It is weaker at:
complex multi-action prompts
large head turns
speaking / mouth motion
hand gestures
strong semantic obedience
large expression changes
camera moves
That tradeoff is expected. A workflow that preserves the source image 1:1 is not going to be as willing to invent new actions. More obedience usually requires more invention; more invention means more risk of face drift.
So the right strategy is not:
force the model to obey huge prompts
The right strategy is:
ask for one small action
add only mild prompt pressure
use seed batching
choose outputs by face permanence first
2. Current control setup
From the screenshot, the current effective workflow is roughly:
Model:
rapidWAN22I2VGGUF_q4KMRapidBase.gguf
VAE:
wan_2.1_vae.safetensors
Text encoder:
umt5-xxl-encoder-Q8_0.gguf
KSampler Advanced:
add_noise: enable
steps: 10
cfg: 1.0
sampler_name: sa_solver
scheduler: beta
start_at_step: 1
end_at_step: 10000
return_with_leftover_noise: enable
Save this as the control workflow.
Do not overwrite it. Duplicate it before experiments.
Testing rule:
same image
same prompt
same seed
same frame count
same resolution
change one setting only
If you change CFG, steps, start step, sampler, and prompt at the same time, the result becomes impossible to interpret.
3. Why CFG should stay low
The Rapid/AIO family is explicitly described as a fast all-in-one merge designed around few steps and CFG 1. One README snapshot recommends:
4 steps
1 cfg
sa_solver sampler
beta scheduler
Source: Phr00t Rapid AIO README snapshot
That does not mean the exact best value for your workflow must be exactly 4 steps. Your screenshot already works at 10 steps. But it does mean this model should be tuned like a few-step distilled / rapid model, not like a normal 20-30 step diffusion workflow.
Do not jump to:
cfg: 3.0
cfg: 4.0
cfg: 5.0
That is likely to cause:
face drift
new skin texture
bloom
over-smoothing
changed lighting
new expression
hallucinated details
Use a micro-range instead.
4. CFG test range
Current baseline:
cfg: 1.0
Recommended test values:
1.00
1.15
1.25
1.35
1.50
Interpretation:
| CFG |
Expected behavior |
1.00 |
maximum source fidelity, weakest negative-prompt effect |
1.15 |
tiny prompt pressure |
1.25 |
likely first useful obedience bump |
1.35 |
upper mild test |
1.50 |
stress test for face drift |
2.00+ |
probably too much if face permanence matters |
The likely useful zone is:
cfg: 1.15-1.35
Rule:
Use the highest CFG that does not change the face.
Test like this:
Run A:
cfg: 1.00
Run B:
cfg: 1.15
Run C:
cfg: 1.25
Run D:
cfg: 1.35
Run E:
cfg: 1.50
Keep everything else identical.
Judge in this order:
1. same face / same identity
2. same source-image quality
3. no morphing
4. no artifacts
5. prompt obedience
6. natural motion
Prompt obedience is not the first priority. A clip that obeys perfectly but changes the face is a failed clip for this workflow.
5. Negative prompts are weak at CFG 1
A common trap is adding a giant negative prompt and expecting it to control the output. In many few-step Wan/Rapid/Lightning-style workflows, CFG 1 means negative prompts are weak or mostly inactive.
The Wan prompting guide explains this directly: in standard diffusion, CFG above 1 gives the model a stronger positive-vs-negative comparison, but in few-step CFG 1 workflows, negative prompts often do little. See How to get the most out of prompts for WAN models.
Practical consequence:
Do not rely on a huge negative prompt.
Put the important preservation rules in the positive prompt.
Positive prompt should explicitly say:
same face
same identity
same hairstyle
same clothing
same lighting
same colors
same camera angle
same background
static camera
no zoom
no scene change
only subtle motion
A short negative prompt is still fine, but it is secondary.
6. start_at_step: test 1 vs 0
Current screenshot:
start_at_step: 1
This may be helping source fidelity. Starting at step 1 can skip a tiny early part of the denoising path, which may reduce repainting.
Test only:
start_at_step: 1
start_at_step: 0
Expected tradeoff:
| Setting |
Likely benefit |
Risk |
1 |
better source fidelity and face permanence |
weaker motion / weaker prompt response |
0 |
more motion and prompt response |
more face drift / more repainting |
Suggested test:
Run A:
cfg: 1.25
start_at_step: 1
steps: 10
Run B:
cfg: 1.25
start_at_step: 0
steps: 10
Possible decisions:
| Result |
Keep |
0 improves obedience and face stays stable |
start_at_step: 0 |
0 gives more motion but face changes |
start_at_step: 1 |
| no meaningful difference |
start_at_step: 1 |
0 adds bloom/repainting |
start_at_step: 1 |
My expectation: start_at_step: 1 may remain the safest default.
7. Steps: test 8 / 10 / 12
Current setting:
steps: 10
This may already be close to the sweet spot.
Few-step distilled models do not always improve with more steps. Sometimes extra steps create more smoothing, blending, or repainting.
Test only:
steps: 8
steps: 10
steps: 12
Expected behavior:
| Steps |
Likely behavior |
8 |
faster, possibly more source-faithful, possibly weaker obedience |
10 |
current working baseline |
12 |
may improve smoothness/obedience, but may add bloom or airbrushing |
16+ |
not recommended for this model unless intentionally stress-testing |
Suggested test:
Run A:
steps: 8
Run B:
steps: 10
Run C:
steps: 12
Keep the best balance. If 12 adds the “dream sequence” look, go back to 10.
8. return_with_leftover_noise: test once
Current screenshot:
return_with_leftover_noise: enable
end_at_step: 10000
Since end_at_step is far beyond the actual step count, the sampler is probably completing its pass. This setting may not matter much, but test it once.
Run A:
return_with_leftover_noise: enable
Run B:
return_with_leftover_noise: disable
Keep whichever preserves the “picture came to life” look.
Do not spend a whole day on this. It is unlikely to be the main obedience or face-permanence control.
9. add_noise: keep enabled
Keep:
add_noise: enable
For image-to-video, the model needs noise to create motion. If you disable it, you may get a more frozen output or odd behavior depending on the rest of the graph.
Only test add_noise: disable if diagnosing a very specific problem:
every seed changes the face
motion is always too aggressive
the image is being repainted too much
Even then, treat it as a diagnostic test, not the likely final setting.
10. Sampler and scheduler: keep sa_solver / beta
Your current best branch uses:
sampler_name: sa_solver
scheduler: beta
Keep that as the main branch.
The Rapid/AIO README snapshot specifically recommends sa_solver and beta for that family. Source: Rapid AIO README snapshot.
If you want to test alternatives, do it only after the CFG/start/steps tests, and keep them as separate branches:
Branch A:
sa_solver / beta
Branch B:
euler / beta
Branch C:
euler_a / beta
Branch D:
euler / simple
Expected behavior:
| Sampler / scheduler |
Likely behavior |
sa_solver / beta |
best current source-fidelity branch |
euler / beta |
may obey differently, possibly less faithful |
euler_a / beta |
more variation/motion, higher face-drift risk |
euler / simple |
more relevant to Lightning/LightX2V-style workflows |
I would not change sampler/scheduler unless the smaller tests fail.
11. Seed batching is now one of the strongest tools
You already noticed face morphing is seed-dependent. That is real.
In video generation, the seed affects:
eye behavior
mouth behavior
micro-expression
small head motion
whether face identity drifts
whether the source texture holds
Use two phases.
Phase A — setting tests
Use one fixed seed:
fixed seed
same image
same prompt
same resolution
same frame count
change one setting only
This tells you what the setting does.
Phase B — production seed search
After choosing settings, run:
8-16 seeds
same image
same prompt
same final settings
short preview first
Pick by this priority:
1. same face / same identity
2. same source-image quality
3. no morphing
4. natural motion
5. prompt obedience
6. no artifacts
For your goal, a seed that keeps the face and obeys 70% is better than a seed that obeys 100% and changes the person.
12. Exact tuning plan
Matrix 0 — save control
Model:
rapidWAN22I2VGGUF_q4KMRapidBase.gguf
VAE:
wan_2.1_vae.safetensors
Text encoder:
umt5-xxl-encoder-Q8_0.gguf
KSampler Advanced:
add_noise: enable
steps: 10
cfg: 1.0
sampler_name: sa_solver
scheduler: beta
start_at_step: 1
end_at_step: 10000
return_with_leftover_noise: enable
Save this output as the reference.
Matrix 1 — CFG
cfg: 1.00
cfg: 1.15
cfg: 1.25
cfg: 1.35
cfg: 1.50
Pick the highest CFG that does not alter identity.
Matrix 2 — start step
Use the best CFG.
start_at_step: 1
start_at_step: 0
Keep 1 unless 0 clearly improves obedience without face drift.
Matrix 3 — steps
Use best CFG and best start step.
steps: 8
steps: 10
steps: 12
Keep the one with the least bloom/airbrushing and best face permanence.
Matrix 4 — leftover noise
Use best CFG/start/steps.
return_with_leftover_noise: enable
return_with_leftover_noise: disable
Keep the more source-faithful result.
Matrix 5 — seed batch
Use final settings.
8-16 seeds
short preview
same prompt
same image
Pick the seed by face permanence first.
13. Recommended presets
Preset A — safest source fidelity
Use when the face must stay the same.
Model:
rapidWAN22I2VGGUF_q4KMRapidBase.gguf
VAE:
wan_2.1_vae.safetensors
Text encoder:
umt5-xxl-encoder-Q8_0.gguf
KSampler Advanced:
add_noise: enable
steps: 10
cfg: 1.0
sampler_name: sa_solver
scheduler: beta
start_at_step: 1
end_at_step: 10000
return_with_leftover_noise: enable
Use for:
portraits
faces
low-res screengrabs
source-quality preservation
subtle motion
Preset B — slightly more obedient
Same as Preset A, except:
cfg: 1.15
Then test:
cfg: 1.25
Stop if the face changes.
Preset C — stronger motion test
Same as Preset A, except:
start_at_step: 0
cfg: 1.15
If the face changes, return to:
start_at_step: 1
Preset D — smoothness test
Same as Preset A, except:
steps: 12
If it adds bloom or airbrushing, return to:
steps: 10
Preset E — faster seed scouting
Same as Preset A, except:
steps: 8
shorter frame count
lower test resolution
Use this only for finding seeds quickly, then rerun good seeds at normal settings.
14. Prompt strategy: one action only
This workflow needs simple prompts.
Bad prompt:
The person turns their head, smiles, raises their hand, looks into the camera, hair moves in the wind, camera slowly zooms in, cinematic lighting.
Why this is bad:
too many actions
requires new expression
requires new pose
requires new hair behavior
requires camera motion
invites lighting changes
increases face drift
Better prompt:
The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.
Best rule:
one generation = one small action
Safe actions:
one subtle blink
gentle breathing
tiny natural smile
slight eye movement
very small head tilt
Risky actions:
speaking
laughing widely
turning head far
walking
dancing
raising hands
hair blowing strongly
camera zoom
camera orbit
lighting change
For this workflow, obedience improves when the requested action is simple enough that the model does not need to repaint the person.
15. Positive prompt templates
Since negative prompts are weak at CFG 1, put preservation constraints in the positive prompt.
Safe source-faithful template
The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Natural subtle motion. Sharp face.
Slightly more expressive template
The same person from the source image makes a tiny natural smile while gently breathing. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.
Minimal template
Same person, same face, same identity, same lighting and background. One subtle blink. Static camera.
Face permanence template
The same person keeps the exact same face and identity throughout the video. Only subtle natural breathing and one small blink. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.
The repetition of “same face” and “same identity” is not elegant, but it is useful conditioning.
16. Negative prompt template
Keep it short.
different person, face change, identity change, warped face, distorted eyes, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, blurry face
Optional additions:
extra teeth, melted face, asymmetrical eyes, over-smoothed skin, airbrushed, bloom
Do not spend all your effort on negative prompting. At CFG 1, it may do very little. At CFG 1.15-1.35, it may help slightly, but positive prompt structure and seed selection matter more.
Reference: Wan prompting guide on CFG 1 / negative prompts
17. Handling complex prompts
The model refuses or hallucinates complex prompts because they ask for too many inventions at once.
A complex prompt often includes:
subject action
facial expression
body motion
camera motion
lighting change
background interpretation
style direction
That is too much for a source-faithful RapidBase workflow.
Instead of:
She turns to the camera, smiles, raises her hand, and the camera slowly zooms in.
Use separate clips:
Clip 1:
same person gently blinks once
Clip 2:
same person makes a tiny natural smile
Clip 3:
same person slightly raises one hand, only if the hand is already visible
Do not ask for a hand raise if the hand is not clearly visible in the source image. If the model must invent a hand, it may also invent a new body or face.
18. Face permanence rules
Face permanence is mostly controlled by:
source image clarity
motion size
CFG
start_at_step
seed
prompt complexity
frame count
camera motion
Do:
use clear face images
keep motion small
use static camera
use one action only
keep CFG low
batch seeds
choose face permanence first
Avoid:
large head turns
speaking
wide smiles
looking away then back
hands crossing the face
camera movement
dramatic emotion
lighting changes
long clips before seed selection
The model is most likely to morph the face when asked for mouth/teeth motion, big expression changes, or head rotation. Blinks and breathing are much safer.
19. Should you add nodes?
Main recommendation:
Add almost nothing.
Your current workflow’s value is that it does not repaint too much. Extra nodes can easily destroy that.
Avoid adding during optimization:
face restore
style LoRAs
multiple LoRAs
high-strength LoRAs
upscalers before judging motion
interpolation before judging motion
color correction before judging model behavior
Upscale/interpolation should happen only after you choose:
prompt
seed
settings
motion
face permanence
20. Optional node: NAG
NAG is the one optional control idea that fits the problem.
Why it may help:
the model runs near CFG 1
negative prompts are weak
raising CFG can morph the face
NAG may add negative-prompt-like control without pushing CFG too hard
The ComfyUI-NAG README says NAG restores effective negative prompting in few-step diffusion models and can complement CFG. The NAG project page similarly describes NAG as a method for restoring negative prompting in few-step sampling.
How to test:
copy the workflow
add NAG only in the copy
keep CFG low
use the same seed and prompt
compare against the saved control
Remove it if it causes:
bloom
airbrushing
texture changes
face drift
loss of source quality
Do not make NAG part of the main workflow until it beats the control.
21. LoRAs: only one, only low strength
The Phr00t Rapid/AIO model card notes Wan 2.1 LoRA compatibility and low-noise Wan 2.2 LoRA compatibility, but warns against high-noise Wan 2.2 LoRAs for that family. See Phr00t WAN2.2 Rapid All-in-One.
If testing LoRAs:
one LoRA only
strength 0.15
strength 0.25
strength 0.35
Avoid:
1.0 strength
multiple LoRAs
style LoRAs
high-noise Wan2.2 LoRAs
character LoRAs unless necessary
For this workflow, LoRAs are more likely to hurt source fidelity than help, unless very targeted.
22. Free prompt restructuring resources
Do not run Ollama or a local LLM on the same GPU while using ComfyUI. On an 8GB card, that competes directly with Wan.
Use web tools or CPU-only local tools.
Free web options
Good enough:
ChatGPT Free
Google AI Studio / Gemini
References:
Use one batched request rather than many small requests.
23. Prompt rewriter request template
Paste this into ChatGPT, Gemini, or a local helper.
Rewrite this as a short Wan2.2 image-to-video prompt for a low-VRAM RapidBase workflow.
Rules:
- one small action only
- preserve exact face and identity
- preserve hairstyle, clothing, lighting, colors, camera angle, and background
- static camera
- no zoom
- no pan
- no scene change
- avoid cinematic embellishment
- avoid new details not visible in the source image
- keep it literal and short
- output exactly 3 versions:
1. safest source-faithful version
2. slightly more expressive version
3. shortest version
Original idea:
<put idea here>
In normal prose, refer to the placeholder as <put idea here>. Inside code blocks, use raw <put idea here>.
This is better than asking “make the prompt better,” because “better” usually means more cinematic, more detailed, and more inventive — exactly what you do not want.
24. CPU-only local prompt helper
A local helper is optional.
Goal:
rewrite prompts
do not use GPU VRAM
do not compete with ComfyUI
A good tiny local option is LFM2.5-1.2B-Instruct-GGUF. LiquidAI’s docs explain that LFM models are available in GGUF format for llama.cpp-style use: LiquidAI llama.cpp deployment guide.
Example CPU-only server command:
llama-server \
-hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
-c 2048 \
-ngl 0 \
--host 127.0.0.1 \
--port 8080
Important part:
-ngl 0
The llama.cpp server docs expose GPU layer offload settings through ngl / GPU layer options; setting GPU layers to zero is the relevant CPU-only principle. See llama.cpp server README.
Recommended order:
1. ChatGPT Free or Gemini
2. LFM2.5-1.2B Q4_K_M CPU-only
3. Qwen 2B-4B CPU-only if you want smarter rewriting
4. larger local models only if you have spare CPU/RAM
25. Prompt helper system prompt
Use this as the system prompt in ChatGPT, Gemini, LFM, Qwen, or any prompt helper.
You are a prompt rewriting assistant for Wan2.2 image-to-video.
Rewrite the user's idea into a short, literal, source-faithful I2V prompt.
Rules:
- Use one small action only.
- Preserve the exact same face and identity.
- Preserve hairstyle, clothing, lighting, colors, camera angle, and background.
- Keep the camera static.
- No zoom.
- No pan.
- No scene change.
- No cinematic embellishment.
- No new objects.
- Avoid talking, dancing, walking, large head turns, and large expression changes.
- Prefer subtle motion: blink, gentle breathing, tiny smile, very small eye movement.
Output exactly:
1. Safest:
2. Slightly more expressive:
3. Shortest:
Do not explain.
Then give it:
Rewrite this idea for Wan2.2 I2V:
<your idea>
Example input:
make her look at the camera and smile a bit, maybe some hair movement
Expected output style:
1. Safest:
The same person from the source image gently blinks once and makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.
2. Slightly more expressive:
The same person from the source image looks naturally toward the camera and makes a very small smile. Preserve the same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Only subtle natural motion. Static camera.
3. Shortest:
Same person, same face and identity. One subtle blink and tiny smile. Static camera. Same lighting and background.
26. What I would do next
- Keep
rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main branch.
- Save the current workflow as the control.
- Test CFG
1.00 / 1.15 / 1.25 / 1.35 / 1.50.
- Test
start_at_step: 1 vs 0.
- Test
steps: 8 / 10 / 12.
- Test
return_with_leftover_noise: enable vs disable once.
- Use seed batches after choosing settings.
- Use one-action prompts.
- Put preservation constraints in the positive prompt.
- Try NAG only in a duplicate workflow if negative prompting remains weak.
- Use ChatGPT/Gemini or CPU-only LFM2.5 for prompt rewriting, not a GPU LLM inside ComfyUI.
Short summary
- Keep
rapidWAN22I2VGGUF_q4KMRapidBase.gguf; it matches the source-fidelity goal.
- Keep
sa_solver / beta as the main branch.
- Do not chase CFG 3+.
- Test CFG only in a tiny range:
1.00 / 1.15 / 1.25 / 1.35 / 1.50.
- Test
start_at_step: 1 versus 0.
- Test
steps: 8 / 10 / 12.
- Use seed batches; face permanence is seed-sensitive.
- At CFG 1, negative prompts are weak. Put identity/background/camera constraints in the positive prompt.
- Use one small action per prompt.
- Add almost nothing to the workflow. NAG is the only optional control node worth testing, and only in a copy.
- For prompt rewriting, use ChatGPT Free, Gemini/AI Studio, or a CPU-only tiny model like LFM2.5.