Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
Junhao Zhuang1, 2 · Shi Guo2 · Xin Cai3 · Xiaohui Li4 · Yihao Liu2 · Chun Yuan1 · Tianfan Xue2, 3
1Tsinghua University · 2Shanghai Artificial Intelligence Laboratory · 3The Chinese University of Hong Kong · 4Shanghai Jiao Tong University
† Corresponding author    ‡ Project Lead
Abstract
Teaser figure for FlashVSR

Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at ∼17 FPS for 768 × 1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train–test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to ∼12× speedup over prior one-step diffusion VSR models. We will release code, models, and the dataset to foster future research in efficient diffusion-based VSR.

Method

We distill a full-attention teacher into a sparse-causal, one-step student for streaming VSR. The model runs causally with a KV cache and uses locality-constrained sparse attention to cut redundant computation and close the train–inference resolution gap. A Tiny Conditional Decoder (TC Decoder) leverages LR frames + latents to reconstruct HR frames efficiently.

Three-stage distillation to sparse-causal one-step VSR
Figure: Three-stage distillation → Sparse-causal one-step VSR with streaming inference.

Locality-Constrained Sparse Attention

Models trained on medium-resolution data often fail to generalize to ultra-high resolutions, exhibiting repeated textures and blurring. This stems from the periodicity of positional encodings, where inference ranges exceed those observed during training, causing pattern repetition across dimensions. We address this by constraining each query to attend within a local spatial window, aligning the positional range between training and inference. Built upon this locality constraint, we further apply sparse attention that focuses computation on the top-k most relevant regions rather than the full feature space, improving both efficiency and visual consistency at high resolutions.

Local window mask with block selection
Figure: Local windows align positional ranges between training and inference to avoid artifacts.

Tiny Conditional Decoder (TC Decoder)

Decoding dominates runtime in VAE-based pipelines. TC Decoder conditions on LR frames and latents, simplifying HR reconstruction for a ~7× faster decode than the original VAE decoder while maintaining near-indistinguishable visual quality.

TC Decoder training and conditioning overview
Figure: Training pipeline of the TC Decoder.
VSR-120K Dataset

We construct VSR-120K, a large-scale dataset designed for high-quality video super-resolution training. The dataset contains around 120k video clips (average >350 frames) and 180k high-resolution images collected from open platforms. To ensure data quality, we filter all samples using LAION-Aesthetic and MUSIQ predictors for visual quality assessment, and apply RAFT to remove segments with insufficient motion. Only videos with resolutions higher than 1080p and adequate temporal dynamics are retained. After multi-stage filtering, we obtain a clean, diverse corpus suitable for large-scale joint image–video super-resolution training.

Real-World Videos Super-Resolution

Drag the vertical bar: left = LR, right = Ours.

AIGC Videos Enhancement
Long Videos Super-Resolution
Ours vs. Others

Switch methods via tabs; each page shows Clips 001–004 (left: baseline, right: Ours).

Ablations

Two key ablations: (1) Global sparse attention vs. Local-constrained sparse attention; (2) WanVAE decoder vs. TC Decoder (badge at top-right shows two lines: “TC Decoder / ~7× speedup”).

BibTeX
@inproceedings{zhuang2025flashvsr,
  title = {FLASHVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution},
  author = {Zhuang, Junhao and Guo, Shi and Cai, Xin and Li, Xiaohui and Liu, Yihao and Yuan, Chun and Xue, Tianfan},
  booktitle = {},
  year = {2026},
  note = {Under review},
  url = {https://example.com/flashvsr}
}