arxiv:2606.25041

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Published on Jun 23

· Submitted by

Lianghua Huang on Jun 25

Wan-AI

Upvote

Authors:

Abstract

Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

View arXiv page View PDF Project page Add to collection

Community

lhhuang

Paper submitter about 9 hours ago

Wan-Streamer v0.1 is a native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.

lhhuang

Paper submitter about 9 hours ago

•

edited about 9 hours ago

Comparison with other interactive models / systems:

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.