Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
1Tsinghua University · 2Shanghai Artificial Intelligence Laboratory · 3The Chinese University of Hong Kong · 4Shanghai Jiao Tong University
† Corresponding author
TL;DR — FlashVSR is a streaming, one-step diffusion-based video super-resolution framework with block-sparse attention and a Tiny Conditional Decoder. It reaches ~17 FPS at 768×1408 on a single A100 GPU. A Locality-Constrained Attention design further improves generalization and perceptual quality on ultra-high-resolution videos.
Abstract
Teaser figure for FlashVSR

Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at ∼17 FPS for 768 × 1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train–test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to ∼12× speedup over prior one-step diffusion VSR models. We will release code, models, and the dataset to foster future research in efficient diffusion-based VSR.

Method

We distill a full-attention teacher into a sparse-causal, one-step student for streaming VSR. The model runs causally with a KV cache and uses locality-constrained sparse attention to cut redundant computation and close the train–inference resolution gap. A Tiny Conditional Decoder (TC Decoder) leverages LR frames + latents to reconstruct HR frames efficiently.

Three-stage distillation to sparse-causal one-step VSR
Figure: Three-stage distillation → Sparse-causal one-step VSR with streaming inference.

Locality-Constrained Sparse Attention

When models trained on medium-resolution data are applied to ultra-high-resolution inputs, they often exhibit repeated textures and blurring. This degradation stems from the periodicity of positional encodings—for instance, in RoPE, when inference spans exceed the positional range encountered during training, the rotational angles wrap around periodically, causing aliasing and pattern repetition along specific spatial dimensions. To mitigate this, we constrain each query to attend only within a local spatial window, ensuring that the positional range during inference remains consistent with that seen in training. Building upon this locality constraint, we further adopt sparse attention that concentrates computation on the top-k most relevant regions rather than across the entire spatial dimensions, significantly improving computational efficiency and perceptual quality at high resolutions.

Local window mask with block selection
Figure: Local windows align positional ranges between training and inference to avoid artifacts.

Tiny Conditional Decoder (TC Decoder)

Compared with the original WanVAE decoder, the proposed Tiny Conditional (TC) Decoder takes not only the latent representations but also the corresponding low-resolution (LR) frames as additional conditioning inputs. By leveraging these LR signals, the TC Decoder effectively simplifies the high-resolution (HR) reconstruction process, achieving a 7× acceleration in decoding speed while maintaining virtually indistinguishable visual quality from the original VAE decoder.

TC Decoder training and conditioning overview
Figure: Training pipeline of the TC Decoder.
VSR-120K Dataset

We construct VSR-120K, a large-scale dataset designed for high-quality video super-resolution training. The dataset contains around 120k video clips (average >350 frames) and 180k high-resolution images collected from open platforms. To ensure data quality, we filter all samples using LAION-Aesthetic and MUSIQ predictors for visual quality assessment, and apply RAFT to remove segments with insufficient motion. Only videos with resolutions higher than 1080p and adequate temporal dynamics are retained. After multi-stage filtering, we obtain a clean, diverse corpus suitable for large-scale joint image–video super-resolution training.

Release note: We are currently organizing (curating) VSR-120K and will open-source the dataset in a future release.

Real-World Videos Super-Resolution

Drag the vertical bar: left = LR, right = Ours.

AIGC Videos Enhancement
Long Videos Super-Resolution
More Results

View detailed comparisons with other methods and ablation studies

🔍 View Comparisons → 🔬 View Ablations →
BibTeX
@misc{zhuang2025flashvsrrealtimediffusionbasedstreaming,
      title={FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution}, 
      author={Junhao Zhuang and Shi Guo and Xin Cai and Xiaohui Li and Yihao Liu and Chun Yuan and Tianfan Xue},
      year={2025},
      eprint={2510.12747},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.12747}, 
}