Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our method is termed LLM-VPR. Results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe LLM-VPR can inspire new possibilities for applying and designing foundation models, i.e. VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.
We first use a vision foundation model (VFM) to extract visual features from the query image and the database images. We then compute the pairwise cosine similarity between the query image and each database image. The top-k database images with the highest similarity scores are selected as the candidate locations.
We use a multimodal large language model (MLLM) to reason about the best candidate location. We prompt the MLLM to describe the differences between the query image and each candidate image in a pairwise manner. The MLLM then reasons about the best candidate based on these descriptions.
We evaluate LLM-VPR in three datasets. Quantitative and qualitative results indicate that our method outperforms vision-only solutions and performs comparably to supervised methods without training overhead. Evaluation results are listed in the table below. The best performances are in bold and the second best are underlined. Please refer to our paper for more detailed results and discussions.
Below are some qualitative results, including candidate images, abbreviated versions of the prompts and the genereated responses. For complete prompts and responses, please refer to the paper and the code repository.
@misc{lyu2024tellaremultimodalllms, title={Tell Me Where You Are: Multimodal LLMs Meet Place Recognition}, author={Zonglin Lyu and Juexiao Zhang and Mingxuan Lu and Yiming Li and Chen Feng}, year={2024}, eprint={2406.17520}, archivePrefix={arXiv}, primaryClass={cs.CV} url={https://arxiv.org/abs/2406.17520}, }