Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Repository: text2nav

Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)

📝 Overview

This repository contains the implementation for our research investigating whether frozen vision-language model embeddings can guide robot navigation without fine-tuning or specialized architectures. We present a minimalist framework that achieves 74% success rate in language-guided navigation using only pretrained SigLIP embeddings.

🎯 Key Findings

🎯 74% success rate using frozen VLM embeddings alone (vs 100% privileged expert)
🔍 3.2x longer paths compared to privileged expert, revealing efficiency limitations
📊 SigLIP outperforms CLIP and ViLT for navigation tasks (74% vs 62% vs 40%)
⚖️ Clear performance-complexity tradeoffs for resource-constrained applications
🧠 Strong semantic grounding but limitations in spatial reasoning and planning

🚀 Method

Our approach consists of two phases:

Expert Demonstration Phase: Train a privileged policy with full state access using PPO
Behavioral Cloning Phase: Distill expert knowledge into a policy using only frozen VLM embeddings

The key insight is using frozen vision-language embeddings as drop-in representations without any fine-tuning, providing an empirical baseline for understanding foundation model capabilities in embodied tasks.

🛠️ Installation

Prerequisites

Python 3.8+
NVIDIA Isaac Sim/Isaac Lab
PyTorch
CUDA-compatible GPU

Setup

git clone https://github.com/oadamharoon/text2nav.git
cd text2nav

# Install dependencies
pip install torch torchvision
pip install transformers
pip install numpy matplotlib
pip install gymnasium

# For Isaac Lab simulation (follow official installation guide)
# https://isaac-sim.github.io/IsaacLab/

📁 Repository Structure

text2nav/
├── CITATION.cff           # Citation information
├── LICENSE                # MIT License
├── README.md              # This documentation
├── IsaacLab/              # Isaac Lab simulation environment setup
├── embeddings/            # Vision-language embedding generation
├── rl/                    # Reinforcement learning expert training
├── generate_embeddings.ipynb    # Generate VLM embeddings from demonstrations
├── revised_gen_embed.ipynb      # Revised embedding generation
├── train_offline.py             # Behavioral cloning training script
├── offlin_train.py              # Alternative offline training
├── bc_model.pt                  # Trained behavioral cloning model
├── td3_bc_model.pt            # TD3+BC baseline model
├── habitat_test.ipynb         # Testing and evaluation notebook
└── replay_buffer.py           # Data handling utilities

🎮 Usage

1. Expert Demonstration Collection

cd rl/
python train_expert.py --env isaac_sim --num_episodes 500

2. Generate VLM Embeddings

jupyter notebook generate_embeddings.ipynb

3. Train Navigation Policy

python train_offline.py --model siglip --embedding_dim 1152 --batch_size 32

📊 Results

Model	Success Rate (%)	Avg Steps	Efficiency
Expert (πβ)	100.0	113.97	1.0×
SigLIP	74.0	369.4	3.2×
CLIP	62.0	417.6	3.7×
ViLT	40.0	472.0	4.1×

🔬 Experimental Setup

Environment: 3m × 3m arena in NVIDIA Isaac Sim
Robot: NVIDIA JetBot with RGB camera (256×256)
Task: Navigate to colored spheres based on language instructions
Targets: 5 colored spheres (red, green, blue, yellow, pink)
Success Criteria: Reach within 0.1m of correct target

💡 Key Insights

Semantic Grounding: Pretrained VLMs excel at connecting language descriptions to visual observations
Spatial Limitations: Frozen embeddings struggle with long-horizon planning and spatial reasoning
Prompt Engineering: Including relative spatial cues significantly improves performance
Embedding Dimensionality: Higher-dimensional embeddings (SigLIP: 1152D) outperform lower-dimensional ones

🔮 Future Work

Hybrid architectures combining frozen embeddings with lightweight spatial memory
Data-efficient adaptation techniques to bridge the efficiency gap
Testing in more complex environments with obstacles and natural language variation
Integration with world models for better spatial reasoning

📚 Citation

@misc{subedi2025pretrainedvisionlanguageembeddingsguide,
      title={Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?}, 
      author={Nitesh Subedi and Adam Haroon and Shreyan Ganguly and Samuel T. K. Tetteh and Prajwal Koirala and Cody Fleming and Soumik Sarkar},
      year={2025},
      eprint={2506.14507},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.14507}, 
}

🙏 Acknowledgments

This work is funded by NSF-USDA COALESCE grant #2021-67021-34418. Special thanks to the Iowa State University Mechanical Engineering Department for their support.

👥 Contributors

Nitesh Subedi* (Iowa State University)
Adam Haroon* (Iowa State University)
Shreyan Ganguly (Iowa State University)
Samuel T.K. Tetteh (Iowa State University)
Prajwal Koirala (Iowa State University)
Cody Fleming (Iowa State University)
Soumik Sarkar (Iowa State University)

*Equal contribution

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

For questions or issues, please open a GitHub issue or contact the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

📝 Overview

🎯 Key Findings

🚀 Method

🛠️ Installation

Prerequisites

Setup

📁 Repository Structure

🎮 Usage

1. Expert Demonstration Collection

2. Generate VLM Embeddings

3. Train Navigation Policy

📊 Results

🔬 Experimental Setup

💡 Key Insights

🔮 Future Work

📚 Citation

🙏 Acknowledgments

👥 Contributors

📄 License

🔗 Links

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
IsaacLab		IsaacLab
embeddings		embeddings
rl		rl
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
bc_model.pt		bc_model.pt
generate_embeddings.ipynb		generate_embeddings.ipynb
habitat_test.ipynb		habitat_test.ipynb
jupyter-ip.sh		jupyter-ip.sh
offlin_train.py		offlin_train.py
pkl2h5.py		pkl2h5.py
replay_buffer.py		replay_buffer.py
revised_gen_embed.ipynb		revised_gen_embed.ipynb
td3_bc_model.pt		td3_bc_model.pt
td3_bc_model_v2.pt		td3_bc_model_v2.pt
train_offline.py		train_offline.py

License

oadamharoon/text2nav

Folders and files

Latest commit

History

Repository files navigation

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

📝 Overview

🎯 Key Findings

🚀 Method

🛠️ Installation

Prerequisites

Setup

📁 Repository Structure

🎮 Usage

1. Expert Demonstration Collection

2. Generate VLM Embeddings

3. Train Navigation Policy

📊 Results

🔬 Experimental Setup

💡 Key Insights

🔮 Future Work

📚 Citation

🙏 Acknowledgments

👥 Contributors

📄 License

🔗 Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages