The exponential growth of visual data—ranging from images to PDFs to streaming videos—has made manual review and analysis virtually impossible. Organizations are struggling to transform this data into actionable insights at scale, leading to missed opportunities and increased risks.
To solve this challenge, vision-language models (VLMs) are emerging as powerful tools, combining visual perception of images and videos with text-based reasoning. Unlike traditional large language models (LLMs) that only process text, VLMs empower you to build visual AI agents that understand and act on complex multimodal data, enabling real-time decision-making and automation.
Imagine having an intelligent AI agent that can analyze remote camera footage to detect early signs of wildfires or scan business documents to extract critical information buried within charts, tables, and images—all autonomously.
With NVIDIA NIM microservices, building these advanced visual AI agents is easier and more efficient than ever. Offering flexible customization, streamlined API integration, and smooth deployment, NIM microservices enable you to create dynamic agents tailored to your unique business needs.
In this post, we guide you through the process of designing and building intelligent visual AI agents using NVIDIA NIM microservices. We introduce the different types of vision AI models available, share four sample applications—streaming video alerts, structured text extraction, multimodal search, and few-shot classification—and provide Jupyter notebooks to get you started. For more information about bringing these models to life, see the /NVIDIA/metropolis-nim-workflows GitHub repo.
Types of vision AI models
To build a robust visual AI agent, you have the following core types of vision models at your disposal:
- VLMs
- Embedding models
- Computer vision (CV) models
These models serve as essential building blocks for developing intelligent visual AI agents. While the VLM functions as the core engine of each agent, CV and embedding models can enhance its capabilities, whether by improving accuracy for tasks like object detection or parsing complex documents.
In this post, we use vision NIM microservices to access these models. Each vision NIM microservice can be easily integrated into your workflows through simple REST APIs, allowing for efficient model inference on text, images, and videos. To get started, you can experiment with hosted preview APIs on build.nvidia.com, without needing a local GPU.
Vision language models
VLMs bring a new dimension to language models by adding vision capabilities, making them multimodal. These models can process images, videos, and text, enabling them to interpret visual data and generate text-based outputs. VLMs are versatile and can be fine-tuned for specific use cases or prompted for tasks such as Q&A based on visual inputs.
NVIDIA and its partners offer several VLMs as NIM microservices each differing in size, latency, and capabilities (Table 1).
Company | Model | Size | Description |
NVIDIA | VILA | 40B | A powerful general-purpose model built on SigLIP and Yi that is suitable for nearly any use case. |
NVIDIA | Neva | 22B | A medium-sized model combining NVGPT and CLIP and offering the functionality of much larger multimodal models. |
Meta | Llama 3.2 | 90B/11B | The first vision-capable Llama model in two sizes, excelling in a range of vision-language tasks and supporting higher-resolution input. |
Microsoft | phi-3.5-vision | 4.2B | A small, fast model that excels at OCR and is capable of processing multiple images. |
Microsoft | Florence-2 | 0.7B | A multi-task model capable of captioning, object detection, and segmentation using simple text prompts. |
Embedding models
Embedding models convert input data (such as images or text) into dense feature-rich vectors known as embeddings. These embeddings encapsulate the essential properties and relationships within the data, enabling tasks like similarity search or classification. Embeddings are typically stored in vector databases where GPU-accelerated search can quickly retrieve relevant data.
Embedding models play a crucial role in creating intelligent agents. For example, they support retrieval-augmented generation (RAG) workflows, enabling agents to pull relevant information from diverse data sources and improve accuracy through in-context learning.
Company | Model | Description | Use Cases |
NVIDIA | NV-CLIP | Multimodal foundation model generating text and image embeddings | Multimodal search, Zero-shot classification |
NVIDIA | NV-DINOv2 | Vision foundation model generating high-resolution image embeddings | Similarity search, Few-shot classification |
Computer vision models
CV models focus on specialized tasks like image classification, object detection, and optical character recognition (OCR). These models can augment VLMs by adding detailed metadata, improving the overall intelligence of AI agents.
Company | Model | Description | Use Cases |
NVIDIA | Grounding Dino | Open-vocabulary object detection | Detect anything |
NVIDIA | OCDRNet | Optical character detection and recognition | Document parsing |
NVIDIA | ChangeNet | Detects pixel-level changes between two images | Defect detection, satellite imagery analysis |
NVIDIA | Retail Object Detection | Pretrained to detect common retail items | Loss prevention |
Build visual AI agents with vision NIM microservices
Here are real-world examples of how the vision NIM microservices can be applied to create powerful visual AI agents.
To make application development with NVIDIA NIM microservices more accessible, we have published a collection of examples on GitHub. These examples demonstrate how to use NIM APIs to build or integrate them into your applications. Each example includes a Jupyter notebook tutorial and demo that can be easily launched, even without GPUs.
On the NVIDIA API Catalog, select a model page, such as Llama 3.1 405B. Choose Get API Key and enter your business email for a 90-day NVIDIA AI Enterprise license, or use your personal email to access NIM through the NVIDIA Developer Program.
On the /NVIDIA/metropolis-nim-workflows GitHub repo, explore the Jupyter notebook tutorials and demos. These workflows showcase how vision NIM microservices can be combined with other components, like vector databases and LLMs to build powerful AI agents that solve real-world problems. With your API key, you can easily recreate the workflows showcased in this post, giving you hands-on experience with Vision NIM microservices.
Here are a few example workflows:
- VLM streaming video alerts agent
- Structured text extraction agent
- Few-shot classification with NV-DINOv2 agent
- Multimodal search with NV-CLIP agent
VLM streaming video alerts agent
With vast amounts of video data generated every second, it’s impossible to manually review footage for key events like package deliveries, forest fires, or unauthorized access.
This workflow shows how to use VLMs, Python, and OpenCV to build an AI agent that autonomously monitors live streams for user-defined events. When an event is detected, an alert is generated, saving countless hours of manual video review. Thanks to the flexibility of VLMs, new events can be detected by changing the prompt—no need for custom CV models to be built and trained for each new scenario..
In Figure 2, the VLM runs in the cloud while the video streaming pipeline operates locally. This setup enables the demo to run on almost any hardware, with the heavy computation offloaded to the cloud through NIM microservices.
Here are the steps for building this agent:
- Load and process the video stream: Use OpenCV to load a video stream or file, decode it, and subsample frames.
- Create REST API endpoints: Use FastAPI to create control REST API endpoints where users can input custom prompts.
- Integrate with the VLM API: A wrapper class handles interactions with the VLM API by sending video frames and user prompts. It forms the NIM API requests and parses the response.
- Overlay responses on video: The VLM response is overlaid onto the input video, streamed out using OpenCV for real-time viewing.
- Trigger alerts: Send the parsed response over a WebSocket server to integrate with other services, triggering notifications based on detected events.
For more information about building a VLM-powered streaming video alert agent, see the /NVIDIA/metropolis-nim-workflows notebook tutorial and demo on GitHub. You can experiment with different VLM NIM microservices to find the best model for your use case.
For more information about how VLMs can transform edge applications with NVIDIA Jetson and Jetson Platform Services, see Develop Generative AI-Powered Visual AI Agents for the Edge and explore additional resources on the Jetson Platform Services page.
Structured text extraction agent
Many business documents are stored as images rather than searchable formats like PDFs. This presents a significant challenge when it comes to searching and processing these documents, often requiring manual review, tagging, and organizing.
While optical character detection and recognition (OCDR) models have been around for a while, they often return cluttered results that fail to retain the original formatting or interpret its visual data. This becomes especially challenging when working with documents in irregular formats, such as photo IDs, which come in various shapes and sizes.
Traditional CV models make processing such documents time-consuming and costly. However, by combining the flexibility of VLMs and LLMs with the precision of OCDR models, you can build a powerful text-extraction pipeline to autonomously parse documents and store user-defined fields in a database.
Here are the structured text-extraction pipeline building steps:
- Document input: Provide an image of the document to an OCDR model, such as OCDRNet or Florence, which returns metadata for all the detected characters in the document.
- VLM integration: The VLM processes the user’s prompt specifying the desired fields and analyzes the document. It uses the detected characters from the OCDR model to generate a more accurate response.
- LLM formatting: The response of the VLM is passed to an LLM, which formats the data into JSON, presenting it as a table.
- Output and storage: The extracted fields are now in a structured format, ready to be inserted into a database or stored for future use.
The preview APIs make it easy to experiment by combining multiple models to build complex pipelines. From the demo UI, you can switch between different VLMs, OCDR, and LLM models available on build.nvidia.com for quick experimentation.
Few-shot classification with NV-DINOv2
NV-DINOv2 generates embeddings from high-resolution images, making it ideal for tasks requiring detailed analysis, such as defect detection with only a few sample images. This workflow demonstrates how to build a scalable few-shot classification pipeline using NV-DINOv2 and a Milvus vector database.
Here is how the few-shot classification pipeline works:
- Define classes and upload samples: Users define classes and upload a few sample images for each. NV-DINOv2 generates embeddings from these images, which are then stored in a Milvus vector database along with the class labels.
- Predict new classes: When a new image is uploaded, NV-DINOv2 generates its embedding, which is compared with the stored embeddings in the vector database. The closest neighbors are identified using the k-nearest neighbors (k-NN) algorithm, and the majority class among them is predicted.
Multimodal search with NV-CLIP
NV-CLIP offers a unique advantage: the ability to embed both text and images, enabling multimodal search. By converting text and image inputs into embeddings within the same vector space, NV-CLIP facilitates the retrieval of images that match a given text query. This enables highly flexible and accurate search results.
In this workflow, users upload a folder of images, which are embedded and stored in a vector database. Using the UI, they can type a query, and NV-CLIP retrieves the most similar images based on the input text.
More advanced agents can be built using this approach with VLMs to create multimodal RAG workflows, enabling visual AI agents to build on past experiences and improve responses.
Get started with visual AI agents today
Ready to dive in and start building your own visual AI agents? Use the code provided in the /NVIDIA/metropolis-nim-workflows GitHub repo as a foundation to develop your own custom workflows and AI solutions powered by NIM microservices. Let the example inspire new applications that solve your specific challenges.
For any technical questions or support, join our community and engage with experts in the NVIDIA Visual AI Agent forum.