Skip to content

wzzheng/StreamVGGT

Repository files navigation

Streaming 4D Visual Geometry Transformer

Streaming 4D Visual Geometry Transformer

Dong Zhuo*, Wenzhao Zheng*$\dagger$, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu

* Equal contribution. $\dagger$ Project leader.

StreamVGGT, a causal transformer architecture for real-time streaming 4D visual geometry perception compatiable with LLM-targeted attention mechanism (e.g., FlashAttention), delivers both fast inference and high-quality 4D reconstruction.

News

  • [2025/7/18] Demo and checkpoints released on Hugging Face; demo code is available for local launch.
  • [2025/7/15] Paper released on arXiv.
  • [2025/7/14] Release the code for fine-tuning VGGT.
  • [2025/7/13] Check out Point3R for another streaming 3D reconstruction work of ours!
  • [2025/7/13] Distillation code for VGGT is released.
  • [2025/7/13] Inference code with FlashAttention-2 is released.
  • [2025/7/13] Training/evaluation code release.

Overview

Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.

overview

On-the-Fly Online Reconstruction from Streaming Inputs

overview

Installation

  1. Clone StreamVGGT
git clone https://github.com/wzzheng/StreamVGGT.git
cd StreamVGGT
  1. Create conda environment
conda create -n StreamVGGT python=3.11 cmake=3.14.0
conda activate StreamVGGT 
  1. Install requirements
pip install -r requirements.txt
conda install 'llvm-openmp<16'

Download Checkpoints

Please download pretrained teacher model from here.

The checkpoint of StreamVGGT is also available at both Hugging Face and Tsinghua cloud.

Data Preparation

Training Datasets

Our training data includes 14 datasets. Please download the datasets from their official sources and refer to CUT3R for processing these datasets.

Evaluation Datasets

Please refer to MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, ScanNet, 7scenes and Neural-RGBD datasets.

Folder Structure

The overall folder structure should be organized as follows:

StreamVGGT
├── ckpt/
|   ├── model.pt
|   └── checkpoints.pth
├── config/
|   ├── ...
├── data/
│   ├── eval/
|   |   ├── 7scenes
|   |   ├── bonn
|   |   ├── kitti
|   |   ├── neural_rgbd
|   |   ├── nyu-v2
|   |   ├── scannetv2
|   |   └── sintel
│   ├── train/
│   │   ├── processed_arkitscenes
|   |   ├── ...
└── src/
    ├── ...

Finetuning VGGT

We also provide the following commands to fine-tune VGGT (excluding the track head) if you like.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./finetune.py --config-name finetune

Training StreamVGGT

We provide the following commands for training.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./train.py --config-name train

Evaluation

The evaluation code follows MonST3R, CUT3R and VGGT.

cd src/

Monodepth

bash eval/monodepth/run.sh 

Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.

VideoDepth

bash eval/video_depth/run.sh 

Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.

Multi-view Reconstruction

bash eval/mv_recon/run.sh 

Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.

Camera Pose Estimation

  1. Install the required dependencies:
pip install pycolmap==3.10.0 pyceres==2.3
git clone https://github.com/cvg/LightGlue.git
cd LightGlue
python -m pip install -e .
cd ..
  1. Please refer to VGGT to prepare the co3d dataset.

  2. Run the evaluation code:

python eval/pose_evaluation/test_co3d.py --co3d_dir /YOUR/CO3D/PATH --co3d_anno_dir /YOUR/CO3D/ANNO/PATH --seed 0

Demo

We provide a demo for StreamVGGT, based on the demo code from VGGT. You can follow the instructions below to launch it locally or try it out directly on Hugging Face.

pip install -r requirements_demo.txt
python demo_gradio.py

Note: While StreamVGGT typically reconstructs a scene in under one second, 3D point visualization may take much longer due to slower third-party rendering.

Acknowledgements

Our code is based on the following brilliant repositories:

DUSt3R MonST3R Spann3R CUT3R VGGT Point3R

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@article{streamVGGT,
      title={Streaming 4D Visual Geometry Transformer}, 
      author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2507.11539},
      year={2025}
}

About

Code for Streaming 4D Visual Geometry Transformer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages