Papers
arxiv:2606.25041

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Published on Jun 23
· Submitted by
Lianghua Huang
on Jun 25
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities.

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

Community

Paper submitter

Wan-Streamer v0.1 is a native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.

framework

Comparison with other interactive models / systems:

image

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.

image

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.25041
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.25041 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25041 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25041 in a Space README.md to link it from this page.

Collections including this paper 2