[🌐Project Page] [📄Paper] [🤗Hugging Face Space] [💻Model Zoo] [📖Introduction] [🎥Video]
- [2024/10/29] We released the code for the local demo.
- [2024/10/29] Vision Search Assistant is released on arxiv.
- Clone this repository and navigate to VSA folder.
git clone https://github.com/cnzzx/VSA.git
cd VSA
- Create conda environments.
conda create -n vsa python=3.10
conda activate vsa
- Install LLaVA.
cd models/LLaVA
pip install -e .
- Install other requirements.
pip install -r requirements.txt
The local demo is based on gradio, and you can simply run with:
python app.py
- In the "Run" UI, you can upload one image in the "Input Image" panel, and type in your question in the "Input Text Prompt" panel. Then, click submit and wait for model inference.
- You can also customize object classes for detection in the "Ground Classes" panel. Please separate each class by commas (followed by a space), such as "handbag, backpack, suitcase."
- On the right are temporary outputs. "Query Output" shows generated queries for searching, and "Search Output" shows web knowledge related to each object.
We provide some samples for you to start with. In the "Samples" UI, you can select one in the "Samples" panel, click "Select This Sample", and you will find sample input has already been filled in the "Run" UI.
You can also chat with our Vision Search Assistant in the terminal by running.
python cli.py \
--vlm-model "liuhaotian/llava-v1.6-vicuna-7b" \
--ground-model "IDEA-Research/grounding-dino-base" \
--search-model "internlm/internlm2_5-7b-chat" \
--vlm-load-4bit
Then, select an image and type your question.
This project is released under the Apache 2.0 license.
Vision Search Assistant is greatly inspired by the following outstanding contributions to the open-source community: GroundingDINO, LLaVA, MindSearch.
If you find this project useful in your research, please consider cite:
@article{zhang2024visionsearchassistantempower,
title={Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines},
author={Zhang, Zhixin and Zhang, Yiyuan and Ding, Xiaohan and Yue, Xiangyu},
journal={arXiv preprint arXiv:2410.21220},
year={2024}
}