ViBT: Vision Bridge Transformer at Scale
Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
xML Lab, National University of Singapore; The Hong Kong Polytechnic University; Shanghai Jiao Tong University
- Bridge formulation: Data-to-data trajectories between inputs and outputs instead of noise-to-data diffusion.
- Scaled transformers: 20B and 1.3B parameter ViBT variants for image/video translation.
- Stabilized training: Variance-stabilized velocity-matching objective for robust large-model optimization.
- Fast inference: Removal of conditional tokens yields up to 4× faster runs versus token-heavy baselines.
- Environment
conda create -n ViBT python=3.12
conda activate ViBT- Install requirements
pip install -e .- Image instruction-based editing and stylization:
examples/image_stylization.ipynb - Video stylization:
examples/video_stylization.ipynb - Video colorization:
examples/video_colorization.ipynb - Video frame interpolation:
examples/video_frame_interpolation.ipynb
We keep different models for image and video tasks.
- Image tasks (stylization, editing) are trained on Qwen-Image-Editing.
- Video tasks (stylization, depth-to-video, colorization, frame interpolation) are trained on Wan2.1 1.3B.
Training code is under development; we will add full instructions once released.
@article{tan2025vision,
title={Vision Bridge Transformer at Scale},
author={Tan, Zhenxiong and Wang, Zeqing and Yang, Xingyi and Liu, Songhua and Wang, Xinchao},
journal={arXiv preprint arXiv:2511.23199},
year={2025}
}
