Streaming 4D Visual Geometry Transformer
Dong Zhuo*, Wenzhao Zheng*
$\dagger$ , Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
* Equal contribution.
StreamVGGT, a causal transformer architecture for real-time streaming 4D visual geometry perception compatiable with LLM-targeted attention mechanism (e.g., FlashAttention), delivers both fast inference and high-quality 4D reconstruction.
- [2025/7/18] Demo and checkpoints released on Hugging Face; demo code is available for local launch.
- [2025/7/15] Paper released on arXiv.
- [2025/7/14] Release the code for fine-tuning VGGT.
- [2025/7/13] Check out Point3R for another streaming 3D reconstruction work of ours!
- [2025/7/13] Distillation code for VGGT is released.
- [2025/7/13] Inference code with FlashAttention-2 is released.
- [2025/7/13] Training/evaluation code release.
Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.
- Clone StreamVGGT
git clone https://github.com/wzzheng/StreamVGGT.git
cd StreamVGGT- Create conda environment
conda create -n StreamVGGT python=3.11 cmake=3.14.0
conda activate StreamVGGT - Install requirements
pip install -r requirements.txt
conda install 'llvm-openmp<16'Please download pretrained teacher model from here.
The checkpoint of StreamVGGT is also available at both Hugging Face and Tsinghua cloud.
Our training data includes 14 datasets. Please download the datasets from their official sources and refer to CUT3R for processing these datasets.
- ARKitScenes
- BlendedMVS
- CO3Dv2
- MegaDepth
- MVS-Synth
- ScanNet++
- ScanNet
- Spring
- Hypersim
- WildRGB-D
- WayMo Open dataset
- Virtual KITTI 2
- OmniObject3D
- PointOdyssey
Please refer to MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, ScanNet, 7scenes and Neural-RGBD datasets.
The overall folder structure should be organized as follows:
StreamVGGT
├── ckpt/
| ├── model.pt
| └── checkpoints.pth
├── config/
| ├── ...
├── data/
│ ├── eval/
| | ├── 7scenes
| | ├── bonn
| | ├── kitti
| | ├── neural_rgbd
| | ├── nyu-v2
| | ├── scannetv2
| | └── sintel
│ ├── train/
│ │ ├── processed_arkitscenes
| | ├── ...
└── src/
├── ...
We also provide the following commands to fine-tune VGGT (excluding the track head) if you like.
cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./finetune.py --config-name finetuneWe provide the following commands for training.
cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./train.py --config-name trainThe evaluation code follows MonST3R, CUT3R and VGGT.
cd src/bash eval/monodepth/run.sh Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.
bash eval/video_depth/run.sh Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.
bash eval/mv_recon/run.sh Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.
- Install the required dependencies:
pip install pycolmap==3.10.0 pyceres==2.3
git clone https://github.com/cvg/LightGlue.git
cd LightGlue
python -m pip install -e .
cd ..-
Please refer to VGGT to prepare the co3d dataset.
-
Run the evaluation code:
python eval/pose_evaluation/test_co3d.py --co3d_dir /YOUR/CO3D/PATH --co3d_anno_dir /YOUR/CO3D/ANNO/PATH --seed 0We provide a demo for StreamVGGT, based on the demo code from VGGT. You can follow the instructions below to launch it locally or try it out directly on Hugging Face.
pip install -r requirements_demo.txt
python demo_gradio.pyNote: While StreamVGGT typically reconstructs a scene in under one second, 3D point visualization may take much longer due to slower third-party rendering.
Our code is based on the following brilliant repositories:
DUSt3R MonST3R Spann3R CUT3R VGGT Point3R
Many thanks to these authors!
If you find this project helpful, please consider citing the following paper:
@article{streamVGGT,
title={Streaming 4D Visual Geometry Transformer},
author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2507.11539},
year={2025}
}

