Skip to content

Yuanshi9815/ViBT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViBT


Project Page arXiv HuggingFace Demo HuggingFace Model

ViBT: Vision Bridge Transformer at Scale
Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
xML Lab, National University of Singapore; The Hong Kong Polytechnic University; Shanghai Jiao Tong University

Features

  • Bridge formulation: Data-to-data trajectories between inputs and outputs instead of noise-to-data diffusion.
  • Scaled transformers: 20B and 1.3B parameter ViBT variants for image/video translation.
  • Stabilized training: Variance-stabilized velocity-matching objective for robust large-model optimization.
  • Fast inference: Removal of conditional tokens yields up to 4× faster runs versus token-heavy baselines.

Quick Start

Setup (Optional)

  1. Environment
conda create -n ViBT python=3.12
conda activate ViBT
  1. Install requirements
pip install -e .

Examples

  • Image instruction-based editing and stylization: examples/image_stylization.ipynb
  • Video stylization: examples/video_stylization.ipynb
  • Video colorization: examples/video_colorization.ipynb
  • Video frame interpolation: examples/video_frame_interpolation.ipynb

Models and Training

We keep different models for image and video tasks.

  • Image tasks (stylization, editing) are trained on Qwen-Image-Editing.
  • Video tasks (stylization, depth-to-video, colorization, frame interpolation) are trained on Wan2.1 1.3B.

Training code is under development; we will add full instructions once released.

BibTeX

@article{tan2025vision,
  title={Vision Bridge Transformer at Scale},
  author={Tan, Zhenxiong and Wang, Zeqing and Yang, Xingyi and Liu, Songhua and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.23199},
  year={2025}
}

About

Vision Bridge Transformer at Scale

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages