Training Vision-Language-Action(VLA) Model for GUI & Computer Use tasks by watching online tutorials. Fully open-sourced dataset, model and training pipeline. Cost efficient solution for GUI task data generation.
针对图形操作界面任务设计的VLA模型和智能体框架。
📑 Paper | 🤗 HuggingFace Collections (Models & Datasets) | 🤖 ModelScope Collections (Models & Datasets) | 🤗 Spaces Demo | 🌐 Webpage
TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents
Bofei Zhang*, Zirui Shang*, Zhi Gao*, Wang Zhang, Rui Xie, Xiaojian Ma, Yuan Tao, Xinxiao Wu, Song-Chun Zhu, Qing Li✉
- Release all experiments/evaluation scripts [WIP].
- [2025.11.14] Accepted by AAAI 2026👑👑👑
- [2025.10.23] Release TongUI-Absolute — annotated dataset with absolute coordinate labels.
- [2025.07.10] Release crawler code and intermediate crawler data🤗. Please feel free to process your own SFT dataset!
- [2025.06.16] Submit evaluation to UI-Vision! Checkout result here and how to reproduce here
- [2025.05.27] Release TongUI-32B model and Training Details.
- [2025.05.06] Release TongUI-7B model and GUI-Net-1M dataset.
- [2025.04.21] Release 🔧 Training pipeline.
- [2025.04.17] Release TongUI-3B model.
Key findings
- Training with this cost-efficient dataset gives SOTA👑 performance on Multiple GUI benchmarks!
- Training with 1M version of dataset make the performance scale up🚀!
Results on ScreenSpot; † means the results are re-produced. We report results on six splits of ScreenSpot and the average scores. The best method is marked in bold. 1M means the dataset is 1M version.
| Model | Data Num | Data Size | Desktop Icon | Desktop Text | Mobile Icon | Mobile Text | Web Icon | Web Text | Average |
|---|---|---|---|---|---|---|---|---|---|
| SeeClick-9.6B | 364K | - | 30.0 | 72.2 | 52.0 | 78.0 | 32.5 | 55.7 | 53.4 |
| UGround-7B | 1.3M | - | 63.6 | 82.5 | 60.3 | 82.8 | 80.4 | 73.3 | 70.4 |
| OmniParser-GPT-4V | - | - | 63.6 | 91.3 | 57.0 | 93.9 | 51.0 | 81.3 | 73.0 |
| ShowUI-2B | 256K | 0.72B | 61.1 | 76.3 | 75.5 | 92.3 | 63.6 | 81.7 | 75.1 |
| Qwen2.5-VL-3B † | - | - | 7.8 | 22.2 | 5.2 | 8.4 | 1.7 | 2.4 | 8.0 |
| Qwen2.5-VL-7B † | - | - | 16.4 | 26.8 | 5.2 | 6.6 | 7.3 | 13.0 | 12.6 |
| TongUI-3B | 399K | 1.24B | 68.5 | 86.5 | 76.0 | 90.5 | 68.4 | 87.4 | 79.6 |
| TongUI-7B | 399K | 1.24B | 75.0 | 91.2 | 79.9 | 93.0 | 72.3 | 88.7 | 83.4 |
| TongUI-3B(1M) | 1.3M | - | 77.1 | 92.3 | 77.7 | 92.6 | 74.8 | 87.8 | 83.6 |
| TongUI-7B(1M) | 1.3M | - | 80.0 | 93.8 | 79.5 | 91.9 | 81.6 | 89.1 | 86.0 |
| TongUI-32B(1M) | 1.3M | - | 80.0 | 94.8 | 84.3 | 96.3 | 84.5 | 91.3 | 88.5 |
Results on Mind2Web. We report results on three types of tasks: cross-task, cross-website, and cross-domain. Elem. Acc means whether the element is selected correctly, OP. F1 denotes the F1 score for the predicted action, and Step SR counts successful steps. 1M means the dataset is 1M version.
| Method | Cross-Task | Cross-Website | Cross-Domain | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Elem. Acc | OP. F1 | Step SR | Elem. Acc | OP. F1 | Step SR | Elem. Acc | OP. F1 | Step SR | |
| CogAgent | 22.4 | 53.0 | 17.6 | 18.4 | 42.4 | 13.4 | 20.6 | 42.0 | 15.5 |
| MindAct | 55.1 | 75.7 | 52.0 | 42.0 | 65.2 | 38.9 | 42.1 | 66.5 | 39.6 |
| OmniParser | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
| ShowUI-2B | 39.9 | 88.6 | 37.2 | 41.6 | 83.5 | 35.1 | 39.4 | 86.8 | 35.2 |
| SeeClick-9.6B | 28.3 | 87.0 | 25.5 | 21.4 | 80.6 | 16.4 | 23.2 | 84.8 | 20.8 |
| Qwen2.5-VL-3B † | 2.5 | 14.5 | 0.4 | 2.7 | 12.6 | 1.0 | 3.3 | 24.2 | 1.7 |
| Qwen2.5-VL-7B † | 6.2 | 72.8 | 5.0 | 6.3 | 68.2 | 4.5 | 8.4 | 73.6 | 7.2 |
| Qwen2.5-VL-3B-ShowUI | 43.2 | 88.7 | 39.7 | 41.3 | 86.7 | 35.5 | 45.1 | 86.1 | 40.7 |
| TongUI-3B | 48.0 | 88.4 | 44.2 | 48.9 | 85.4 | 42.6 | 50.0 | 87.7 | 46.0 |
| TongUI-7B | 51.1 | 88.7 | 46.9 | 50.4 | 87.5 | 43.7 | 53.9 | 88.6 | 49.1 |
| TongUI-3B(1M) | 53.4 | 89.0 | 48.8 | 54.2 | 86.4 | 48.1 | 53.8 | 88.2 | 49.5 |
| TongUI-7B(1M) | 58.1 | 88.7 | 53.4 | 55.6 | 87.2 | 49.0 | 57.6 | 88.7 | 52.9 |
| TongUI-32B(1M) | 57.2 | 88.1 | 52.4 | 57.4 | 85.8 | 50.6 | 59.2 | 87.8 | 54.1 |
For other experiments, please refer to our paper.
We use uv to manage the dependencies.
uv sync --all-groupsTo using conda and pip to install the dependencies.
conda create -n tongui python=3.12
conda activate tongui
pip install -e .To execute any script by uv, you can use the following command.
uv run <script_name>.pyJust replace uv with python if you are using conda or pip to install the dependencies.
python <script_name>.pyWe host an online Gradio Demo on Hugging Face Spaces. Please feel free to try it. We also open source the code for this demo. Feel free to run it locally.
git clone https://huggingface.co/spaces/Bofeee5675/TongUI
cd TongUI
uv run app.pyYou can programatically call the TongUI API by using the following code.
uv run examples/api.pyYou can serve the model by vLLM.
uv run vllm serve Bofeee5675/TongUI-3B --port 8000 --served-model-name tongui-3b --limit-mm-per-prompt image=3Then, you can use openai compatible API to call the model. Checkout examples/call_vllm.py for more details.
uv run examples/call_vllm.pyCheckout examples/inference.py for local inference.
uv run examples/inference.pyFor detailed information about model training, including hyperparameters, data preprocessing, and training configurations, please refer to our Training Documentation.
For comprehensive experimental results, ablation studies, and evaluation details, please check our Experiments Documentation.
We thank the following projects for their wonderful works.
- We adopt experiments, data preprocessing pipeline from ShowUI
- We train our model by using LLaMA-Factory
- Thanks for Qwen2.5-VL series model and UI-TARS for their great work.
If you find this work useful in your research, please consider citing:
@article{zhang2025tongui,
title={TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials},
author={Zhang, Bofei and Shang, Zirui and Gao, Zhi and Zhang, Wang and Xie, Rui and Ma, Xiaojian and Yuan, Tao and Wu, Xinxiao and Zhu, Song-Chun and Li, Qing},
journal={arXiv preprint arXiv:2504.12679},
year={2025}
}

