Wan2.2 i2v (clarifications needed regarding settings on low vram system)

is there any way to formulate the prompts that would imply . do action A, after action A is done do action B?

There seem to be several methods available, but some of them are difficult to use in an 8GB VRAM environment:


Wan2.2 RapidBase I2V: sequential actions, prompt weights, and continuity-safe A → B workflows

Short answer:

Yes, you can write prompts like “do action A, then after A is done, do action B.” The model can understand that language. But in a normal single-prompt I2V workflow, that instruction is usually a soft temporal suggestion, not a reliable frame-accurate command.

For the current RapidBase workflow, the safest ranking is:

1. One-clip two-beat prompt
2. Two clips with a handoff frame
3. Neutral overlap + short crossfade
4. FLF2V bridge clip
5. Prompt Relay
6. Prompt Schedule / FizzNodes

For prompt weights:

Avoid:
  (((action:1.9)))

Prefer:
  (same face and identity:1.10)
  (preserve exact face:1.10)
  (static camera:1.10)
  (tiny natural smile:1.05)

The key rule is:

Weight preservation more than action.

The current workflow is working because it preserves the source image. Anything that pushes too hard toward complex action can also push the model into repainting, hallucination, or face drift.


1. Does the model understand “A, then B”?

It can understand the wording, but it does not necessarily execute it as an exact timeline.

A prompt like this is understandable:

The same person first blinks once, then after a brief pause makes a tiny natural smile.

But in a normal I2V generation, the text prompt conditions the whole clip. It is not automatically split into exact frame ranges like:

frames 0-16:
  action A

frames 17-33:
  action B

So the model may interpret “first A, then B” loosely.

Possible outcomes:

Prompt Possible model behavior
blink once, then smile blink and smile happen in the right order
blink once, then smile smile starts before the blink finishes
blink once, then smile only the smile happens
look down, then look back gaze drifts vaguely instead of following exact order
A then B then C one action is skipped or the face starts drifting

This is normal for a single-prompt video model. The model sees the whole instruction, but it is not a strict animation timeline unless you use timeline-control tools.

Useful background:


2. Best first method: one-clip two-beat prompting

For the current RapidBase workflow, this is the best first method.

It does not add nodes, LoRAs, bridge models, scheduling tools, or extra VRAM load. It also protects the main thing the current setup is good at:

same source image
same face
same lighting
same texture
same background
low hallucination

The limitation is that A → B order is only approximate.

Good use cases

Use one-clip two-beat prompts for small actions:

blink once -> tiny smile
gentle breathing -> blink once
look slightly downward -> return eyes to camera
tiny smile -> neutral expression
eyes shift slightly left -> eyes return to camera
neutral expression -> tiny smile

Bad use cases

Avoid large or multi-stage sequences:

turn head -> talk -> raise hand
walk forward -> gesture -> camera zooms in
look away -> laugh -> turn back
large smile -> speaking -> hair blowing
pose change -> lighting change -> background reaction

Each extra action increases the chance of:

face drift
changed mouth shape
changed eye shape
new lighting
new camera angle
background mutation
AI-looking repainting

3. Good wording for “after A is done, do B”

Use completion language, not just a loose list.

Weak:

blink and smile

Better:

first blinks once, then after the blink is complete, slowly forms a tiny natural smile

Good sequencing phrases:

first <action A>, then after a brief pause <action B>
after <action A> is complete, <action B>
begins still, then <action A>, then settles into <action B>
first holds a neutral expression, then gradually <action B>
after returning to neutral, <action B>

Avoid vague or overloaded phrasing:

blink and smile naturally
perform a sequence of expressions
react emotionally
do a cute expression
move seductively
act naturally

Vague words invite the model to improvise. Improvisation is where identity drift usually starts.


4. Practical one-clip formula

Use this structure:

[identity lock] + [starting state] + [action A] + [pause/settle] + [action B] + [camera lock] + [scene lock]

Example:

The same person from the source image keeps the exact same face and identity. The video begins with a calm neutral expression. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change.

This is better than:

she blinks then smiles

because it tells the model:

who must remain the same
what state to start from
what action comes first
what happens after
what must not change

5. One-clip two-beat prompt templates

Safest A → B template

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Subtle natural motion only.

More explicit timing template

The video begins with the same person holding still. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first template

Preserve the exact same face and identity throughout the video. The same person first blinks once, then after a brief pause makes a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Short template

Same person, same face and identity. First one subtle blink, then a tiny natural smile. Static camera. Same lighting and background.

Very safe template

The same person keeps the exact same face and identity throughout the video. First, one small blink. Then, a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

6. When a single prompt is not enough

If exact order matters, use two clips.

Do not do this:

Clip 1:
  original source image -> action A

Clip 2:
  original source image -> action B

That creates two independent clips from the same original starting point. Clip 2 does not know where Clip 1 ended.

Better:

Clip 1:
  original source image -> action A only

Handoff frame:
  clean stable frame near the end of Clip 1

Clip 2:
  handoff frame -> action B only

This is the most practical way to get reliable A → B ordering without adding complex nodes.


7. Handoff-frame workflow

Process

1. Generate Clip 1 with action A only.
2. Inspect the last 3-10 frames.
3. Do not blindly use the final frame.
4. Pick the cleanest stable frame:
   - best face
   - least blur
   - stable lighting
   - stable background
   - expression suitable for the next action
5. Save that frame as PNG.
6. Use it as the source image for Clip 2.
7. Prompt Clip 2 for action B only.
8. Keep settings consistent:
   - same resolution
   - same FPS
   - same VAE
   - same text encoder
   - same sampler
   - same scheduler
   - same CFG
   - same steps
   - same prompt style

Clip 1 example

The same person gently blinks once, then returns to a calm neutral expression. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

Clip 2 example

The same person begins from a calm neutral expression, then slowly forms a tiny natural smile. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

This is more reliable than trying to force a complex sequence into one prompt.


8. Neutral overlap and crossfade

If using two clips, make the join happen during a neutral moment.

Bad join:

Clip 1 ends during a blink.
Clip 2 starts with a smile.

Better join:

Clip 1 ends after returning to neutral.
Clip 2 starts from the neutral handoff frame.

If the join is slightly visible, use a short crossfade.

Typical overlap:

4-8 frames
same FPS
same resolution
same color settings
same encoding settings

FFmpeg example:

ffmpeg \
  -i clip1.mp4 \
  -i clip2.mp4 \
  -filter_complex "xfade=transition=fade:duration=0.25:offset=2.75" \
  -c:v libx264 -crf 18 -preset slow \
  output.mp4

offset must be adjusted to match the length of Clip 1 and the desired transition point.

Reference:

Important:

A crossfade can hide a small seam. It cannot fix a true face, lighting, or background mismatch.

If the face is different between the clips, a crossfade may create ghosting or a double-face dissolve.


9. FLF2V bridge clip

FLF2V means First-Last Frame to Video.

Instead of simply crossfading Clip A into Clip B, you provide:

first frame = stable end frame of Clip A
last frame  = stable start frame of Clip B

Then the model generates the transition between them.

Concept:

Clip A:
  source -> action A

Clip B:
  handoff/source -> action B

Bridge:
  first frame = stable end frame of Clip A
  last frame  = stable start frame of Clip B
  prompt      = smooth subtle transition, same face, same lighting, static camera

Why it can help:

more natural transition than crossfade
can reduce a sudden jump between two clips
uses actual visual endpoints

Why it may not be ideal for the current 8GB RapidBase workflow:

separate workflow family
may be heavier
may not preserve the same RapidBase look
may introduce bloom or airbrushed style
may require more setup and testing

Use FLF2V only if:

Clip A is good.
Clip B is good.
The join is visibly bad.
A simple crossfade is not good enough.
The bridge can be short.

References:


10. Prompt Relay

Prompt Relay is closer to the real solution for “A happens in one segment, B happens in another segment.”

Instead of relying on a single prompt, Prompt Relay routes different prompts through different temporal segments.

Concept:

Global prompt:
  same person, same face, same identity, same lighting, same background, static camera

Segment 1:
  blink once

Segment 2:
  tiny natural smile

Why it is attractive:

A and B happen inside one timeline
less independent-clip continuity drift
global identity/camera constraints can stay active
different segments can receive different action prompts

Why it should be treated carefully:

changes the workflow structure
may not plug cleanly into the current RapidBase GGUF workflow
may increase complexity
8GB behavior is uncertain
could break the source-fidelity look

Do not add it to the working workflow directly. Test only in a duplicate workflow.

References:


11. Prompt Schedule / FizzNodes

Prompt scheduling is the general concept of changing prompt conditioning over time.

Concept:

Frames 0-16:
  same person gently blinks once

Frames 17-33:
  same person slowly forms a tiny smile

Why it may help:

more explicit temporal control
frame/segment-based prompt changes
better than hoping a single prompt follows order

Why it is not the first recommendation here:

not guaranteed to fit the current RapidBase GGUF workflow
can change conditioning behavior
may break the current source-fidelity look
adds complexity

References:


12. Recommended method ranking

Rank Method A → B reliability Source fidelity 8GB friendliness Recommendation
1 One-clip two-beat prompt Medium High High Try first
2 Two clips + handoff frame High Medium-high High Best practical method
3 Two clips + neutral crossfade Medium-high Medium-high High Good polish
4 FLF2V bridge High for transition Medium Medium-low Separate experiment
5 Prompt Relay High conceptually Unknown Unknown Advanced experiment
6 Prompt Schedule / FizzNodes Medium-high conceptually Unknown Medium Experimental

Best practical rule:

simple A -> B:
  use one-clip two-beat prompt

strict A -> B:
  use two clips with a handoff frame

smooth transition:
  use handoff frame + optional crossfade

true timeline control:
  test Prompt Relay or Prompt Schedule only in a duplicate workflow

13. Prompt weights: do they work?

Yes, ComfyUI prompt weights can work.

Common syntax:

(phrase:1.2)

Plain parentheses also increase weight. ComfyUI’s CLIPTextEncode documentation says plain parentheses apply a default weight of 1.1, and the ComfyUI Community Manual says nested weights multiply.

Examples:

(phrase)
  roughly increases emphasis

(phrase:1.2)
  explicit weight

((phrase:1.2):0.5)
  nested weights multiply

References:


14. Is (((weighted prompts:1.9))) useful here?

Probably not. For this workflow, it is more likely to hurt than help.

Avoid:

(((turns head and smiles:1.9)))

Avoid:

(((first blinks then smiles:1.9)))

Avoid:

(((action A then action B:1.9)))

Why? Because a huge action weight tells the model:

This action matters more than preserving the source image.

That can cause:

face drift
changed facial geometry
changed skin texture
changed lighting
hallucinated details
mouth/teeth weirdness
background changes
overcooked motion
loss of source fidelity

The current RapidBase workflow works because it is conservative. Heavy action weights fight that.


15. Better prompt-weight strategy

Do not heavily weight the action. Lightly weight preservation.

Better:

The same person first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Risky:

The same person (((first blinks then smiles:1.9))). Same face and background.

The first prompt says:

identity and camera are important
action is small
do not repaint

The second prompt says:

force this action even if the model has to invent

For face permanence, that is the wrong priority.


16. Suggested weight ranges

Weight Use
1.00 normal baseline
1.05 tiny emphasis
1.10 safe emphasis
1.15 useful emphasis for identity/static camera
1.20 upper normal test
1.25 mild stress test
1.35 risky; use sparingly
1.50+ likely too strong
1.90 avoid for source-faithful I2V

For this workflow, use mostly:

1.05-1.20

Maybe test:

1.25

Avoid:

1.50+
1.90
triple-parentheses action forcing

17. What to weight

Good things to weight

(same face and identity:1.10)
(preserve exact face:1.10)
(preserve source image:1.10)
(static camera:1.10)
(no scene change:1.05)
(same lighting and background:1.10)
(tiny natural smile:1.05)
(one subtle blink:1.05)

Risky things to weight

(turns head:1.4)
(speaks:1.4)
(laughs widely:1.4)
(raises hand:1.4)
(hair blowing:1.4)
(camera zooms in:1.4)

Very risky

(((wide smile:1.9)))
(((speaking:1.9)))
(((turning head:1.9)))
(((complex action sequence:1.9)))

Large facial motion and mouth motion are exactly where face permanence usually breaks.


18. Weighted A → B examples

Safe weighted A → B prompt

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Slightly stronger action prompt

The same person begins still, then (gently blinks once:1.05), then slowly forms a (tiny natural smile:1.10). Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first prompt

(Preserve the exact same face and identity:1.15). The same person first blinks once, then slowly forms a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Minimal weighted prompt

(Same face and identity:1.15). First one subtle blink, then a tiny smile. Same lighting and background. (Static camera:1.10).

19. What not to do

Avoid this:

(((The person first blinks, then smiles, then turns their head, then speaks:1.9)))

That stacks three problems:

too many actions
too much weight
weighting the part that causes identity drift

Also avoid:

First she blinks, then smiles, then speaks, then turns her head, while the camera zooms in and the lighting becomes cinematic.

That asks the model to solve:

facial motion
mouth motion
head rotation
camera motion
lighting change
identity preservation
background stability

That is too much for a source-faithful RapidBase clip.


20. Practical A → B workflow

Step 1 — choose the smallest version of the action

Instead of:

turns head and smiles

use:

tiny eye movement and tiny smile

Instead of:

speaks

use:

subtle mouth movement

Instead of:

laughs

use:

tiny natural smile

Step 2 — write a two-beat prompt

The same person first <action A>, then after a brief pause <action B>. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Step 3 — add light preservation weights

(Preserve the exact same face and identity:1.15)
(Static camera:1.10)

Step 4 — batch seeds

8-16 seeds
same prompt
same settings
short preview

Pick by:

1. face permanence
2. correct order
3. natural motion
4. prompt obedience

Step 5 — split into two clips if order fails

If the model keeps blending A and B, use:

Clip 1:
  action A only

Handoff:
  clean stable frame near the end of Clip 1

Clip 2:
  action B only

21. Best practical recommendation

For the current RapidBase workflow:

Use one-clip two-beat prompts first.
Use "first A, then after a brief pause B."
Keep A and B very small.
Batch seeds.
Weight preservation, not action.
Avoid 1.9 weights.
Use handoff-frame two-clip generation when strict order matters.
Only test Prompt Relay / scheduling in a duplicate workflow.

22. Example final prompts

Blink → smile

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Look down → return gaze

The same person from the source image first looks slightly downward with only a tiny eye movement, then returns the eyes to the camera. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Neutral → tiny smile

The video begins with the same person holding a calm neutral expression. Then the person slowly forms a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Breathing → blink

The same person from the source image keeps the exact same face and identity. The person gently breathes with subtle natural motion, then blinks once after a brief pause. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

23. Reference links

Wan prompting / CFG behavior

Rapid / AIO background

FLF2V / Wan2.2 workflows

Prompt Relay / timeline control

Prompt scheduling

Prompt weighting

Crossfade