Streaming 4D Visual Geometry Transformer

Paper | Project Page | Online Demo

Streaming 4D Visual Geometry Transformer

Dong Zhuo^*, Wenzhao Zheng^*$\dagger$, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu

^* Equal contribution. $\dagger$ Project leader.

StreamVGGT, a causal transformer architecture for real-time streaming 4D visual geometry perception compatiable with LLM-targeted attention mechanism (e.g., FlashAttention), delivers both fast inference and high-quality 4D reconstruction.

News

[2025/7/18] Demo and checkpoints released on Hugging Face; demo code is available for local launch.
[2025/7/15] Paper released on arXiv.
[2025/7/14] Release the code for fine-tuning VGGT.
[2025/7/13] Check out Point3R for another streaming 3D reconstruction work of ours!
[2025/7/13] Distillation code for VGGT is released.
[2025/7/13] Inference code with FlashAttention-2 is released.
[2025/7/13] Training/evaluation code release.

Overview

Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.

On-the-Fly Online Reconstruction from Streaming Inputs

Installation

Clone StreamVGGT

git clone https://github.com/wzzheng/StreamVGGT.git
cd StreamVGGT

Create conda environment

conda create -n StreamVGGT python=3.11 cmake=3.14.0
conda activate StreamVGGT

Install requirements

pip install -r requirements.txt
conda install 'llvm-openmp<16'

Download Checkpoints

Please download pretrained teacher model from here.

The checkpoint of StreamVGGT is also available at both Hugging Face and Tsinghua cloud.

Data Preparation

Training Datasets

Our training data includes 14 datasets. Please download the datasets from their official sources and refer to CUT3R for processing these datasets.

Evaluation Datasets

Please refer to MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, ScanNet, 7scenes and Neural-RGBD datasets.

Folder Structure

The overall folder structure should be organized as follows：

StreamVGGT
├── ckpt/
|   ├── model.pt
|   └── checkpoints.pth
├── config/
|   ├── ...
├── data/
│   ├── eval/
|   |   ├── 7scenes
|   |   ├── bonn
|   |   ├── kitti
|   |   ├── neural_rgbd
|   |   ├── nyu-v2
|   |   ├── scannetv2
|   |   └── sintel
│   ├── train/
│   │   ├── processed_arkitscenes
|   |   ├── ...
└── src/
    ├── ...

Finetuning VGGT

We also provide the following commands to fine-tune VGGT (excluding the track head) if you like.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./finetune.py --config-name finetune

Training StreamVGGT

We provide the following commands for training.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./train.py --config-name train

Evaluation

The evaluation code follows MonST3R, CUT3R and VGGT.

cd src/

Monodepth

bash eval/monodepth/run.sh

Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.

VideoDepth

bash eval/video_depth/run.sh

Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.

Multi-view Reconstruction

bash eval/mv_recon/run.sh

Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.

Camera Pose Estimation

Install the required dependencies:

pip install pycolmap==3.10.0 pyceres==2.3
git clone https://github.com/cvg/LightGlue.git
cd LightGlue
python -m pip install -e .
cd ..

Please refer to VGGT to prepare the co3d dataset.
Run the evaluation code:

python eval/pose_evaluation/test_co3d.py --co3d_dir /YOUR/CO3D/PATH --co3d_anno_dir /YOUR/CO3D/ANNO/PATH --seed 0

Demo

We provide a demo for StreamVGGT, based on the demo code from VGGT. You can follow the instructions below to launch it locally or try it out directly on Hugging Face.

pip install -r requirements_demo.txt
python demo_gradio.py

Note: While StreamVGGT typically reconstructs a scene in under one second, 3D point visualization may take much longer due to slower third-party rendering.

Acknowledgements

Our code is based on the following brilliant repositories:

DUSt3R MonST3R Spann3R CUT3R VGGT Point3R

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@article{streamVGGT,
      title={Streaming 4D Visual Geometry Transformer}, 
      author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2507.11539},
      year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
assets		assets
cloud_opt		cloud_opt
config		config
datasets_preprocess		datasets_preprocess
examples		examples
lib		lib
src		src
.DS_Store		.DS_Store
LICENSE.txt		LICENSE.txt
README.md		README.md
demo_gradio.py		demo_gradio.py
requirements.txt		requirements.txt
requirements_demo.txt		requirements_demo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Streaming 4D Visual Geometry Transformer

Paper | Project Page | Online Demo

News

Overview

On-the-Fly Online Reconstruction from Streaming Inputs

Installation

Download Checkpoints

Data Preparation

Training Datasets

Evaluation Datasets

Folder Structure

Finetuning VGGT

Training StreamVGGT

Evaluation

Monodepth

VideoDepth

Multi-view Reconstruction

Camera Pose Estimation

Demo

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

wzzheng/StreamVGGT

Folders and files

Latest commit

History

Repository files navigation

Streaming 4D Visual Geometry Transformer

Paper | Project Page | Online Demo

News

Overview

On-the-Fly Online Reconstruction from Streaming Inputs

Installation

Download Checkpoints

Data Preparation

Training Datasets

Evaluation Datasets

Folder Structure

Finetuning VGGT

Training StreamVGGT

Evaluation

Monodepth

VideoDepth

Multi-view Reconstruction

Camera Pose Estimation

Demo

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages