We distill a full-attention teacher into a sparse-causal, one-step student for streaming VSR. The model runs causally with a KV cache and uses locality-constrained sparse attention to cut redundant computation and close the train–inference resolution gap. A Tiny Conditional Decoder (TC Decoder) leverages LR frames + latents to reconstruct HR frames efficiently.
Models trained on medium-resolution data often fail to generalize to ultra-high resolutions, exhibiting repeated textures and blurring. This stems from the periodicity of positional encodings, where inference ranges exceed those observed during training, causing pattern repetition across dimensions. We address this by constraining each query to attend within a local spatial window, aligning the positional range between training and inference. Built upon this locality constraint, we further apply sparse attention that focuses computation on the top-k most relevant regions rather than the full feature space, improving both efficiency and visual consistency at high resolutions.
Decoding dominates runtime in VAE-based pipelines. TC Decoder conditions on LR frames and latents, simplifying HR reconstruction for a ~7× faster decode than the original VAE decoder while maintaining near-indistinguishable visual quality.
We construct VSR-120K, a large-scale dataset designed for high-quality video super-resolution training. The dataset contains around 120k video clips (average >350 frames) and 180k high-resolution images collected from open platforms. To ensure data quality, we filter all samples using LAION-Aesthetic and MUSIQ predictors for visual quality assessment, and apply RAFT to remove segments with insufficient motion. Only videos with resolutions higher than 1080p and adequate temporal dynamics are retained. After multi-stage filtering, we obtain a clean, diverse corpus suitable for large-scale joint image–video super-resolution training.
Drag the vertical bar: left = LR, right = Ours.
Switch methods via tabs; each page shows Clips 001–004 (left: baseline, right: Ours).
Two key ablations: (1) Global sparse attention vs. Local-constrained sparse attention; (2) WanVAE decoder vs. TC Decoder (badge at top-right shows two lines: “TC Decoder / ~7× speedup”).
@inproceedings{zhuang2025flashvsr,
title = {FLASHVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution},
author = {Zhuang, Junhao and Guo, Shi and Cai, Xin and Li, Xiaohui and Liu, Yihao and Yuan, Chun and Xue, Tianfan},
booktitle = {},
year = {2026},
note = {Under review},
url = {https://example.com/flashvsr}
}