LLM4VPR

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

New York University

Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our method is termed LLM-VPR. Results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe LLM-VPR can inspire new possibilities for applying and designing foundation models, i.e. VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.

Quantitative Evaluation

We evaluate LLM-VPR in three datasets. Quantitative and qualitative results indicate that our method outperforms vision-only solutions and performs comparably to supervised methods without training overhead. Evaluation results are listed in the table below. The best performances are in bold and the second best are underlined. Please refer to our paper for more detailed results and discussions.

Qualitative Results

Below are some qualitative results, including candidate images, abbreviated versions of the prompts and the genereated responses. For complete prompts and responses, please refer to the paper and the code repository.

BibTeX

@misc{lyu2024tellaremultimodalllms,
      title={Tell Me Where You Are: Multimodal LLMs Meet Place Recognition}, 
      author={Zonglin Lyu and Juexiao Zhang and Mingxuan Lu and Yiming Li and Chen Feng},
      year={2024},
      eprint={2406.17520},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      url={https://arxiv.org/abs/2406.17520}, 
}

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Abstract

Pipeline

Stage 1: Vision-based Coarse Retrieval

Stage 2: Language-based Fine Reasoning

Quantitative Evaluation

Qualitative Results

BibTeX