Generative AI

Build Multimodal Visual AI Agents Powered by NVIDIA NIM

Decorative image.

The exponential growth of visual data—ranging from images to PDFs to streaming videos—has made manual review and analysis virtually impossible. Organizations are struggling to transform this data into actionable insights at scale, leading to missed opportunities and increased risks.

To solve this challenge, vision-language models (VLMs) are emerging as powerful tools, combining visual perception of images and videos with text-based reasoning. Unlike traditional large language models (LLMs) that only process text, VLMs empower you to build visual AI agents that understand and act on complex multimodal data, enabling real-time decision-making and automation.

Imagine having an intelligent AI agent that can analyze remote camera footage to detect early signs of wildfires or scan business documents to extract critical information buried within charts, tables, and images—all autonomously.

With NVIDIA NIM microservices, building these advanced visual AI agents is easier and more efficient than ever. Offering flexible customization, streamlined API integration, and smooth deployment, NIM microservices enable you to create dynamic agents tailored to your unique business needs.

In this post, we guide you through the process of designing and building intelligent visual AI agents using NVIDIA NIM microservices. We introduce the different types of vision AI models available, share four sample applications—streaming video alerts, structured text extraction, multimodal search, and few-shot classification—and provide Jupyter notebooks to get you started. For more information about bringing these models to life, see the /NVIDIA/metropolis-nim-workflows GitHub repo. 

Types of vision AI models

To build a robust visual AI agent, you have the following core types of vision models at your disposal:

  • VLMs
  • Embedding models
  • Computer vision (CV) models

These models serve as essential building blocks for developing intelligent visual AI agents. While the VLM functions as the core engine of each agent, CV and embedding models can enhance its capabilities, whether by improving accuracy for tasks like object detection or parsing complex documents.

In this post, we use vision NIM microservices to access these models. Each vision NIM microservice can be easily integrated into your workflows through simple REST APIs, allowing for efficient model inference on text, images, and videos. To get started, you can experiment with hosted preview APIs on build.nvidia.com, without needing a local GPU. 

GIF shows the llama-3.2-vision-90b model summarizing an image. 
Figure 1.The llama-3.2-vision-90b model on build.nvidia.com

Vision language models

VLMs bring a new dimension to language models by adding vision capabilities, making them multimodal. These models can process images, videos, and text, enabling them to interpret visual data and generate text-based outputs. VLMs are versatile and can be fine-tuned for specific use cases or prompted for tasks such as Q&A based on visual inputs. 

NVIDIA and its partners offer several VLMs as NIM microservices each differing in size, latency, and capabilities (Table 1). 

CompanyModelSizeDescription
NVIDIAVILA40BA powerful general-purpose model built on SigLIP and Yi that is suitable for nearly any use case. 
NVIDIANeva22BA medium-sized model combining NVGPT and CLIP and offering the functionality of much larger multimodal models. 
MetaLlama 3.290B/11BThe first vision-capable Llama model in two sizes, excelling in a range of vision-language tasks and supporting higher-resolution input. 
Microsoftphi-3.5-vision4.2BA small, fast model that excels at OCR and is capable of processing multiple images. 
MicrosoftFlorence-20.7BA multi-task model capable of captioning, object detection, and segmentation using simple text prompts. 
Table 1. VLM NIM microservices

Embedding models

Embedding models convert input data (such as images or text) into dense feature-rich vectors known as embeddings. These embeddings encapsulate the essential properties and relationships within the data, enabling tasks like similarity search or classification. Embeddings are typically stored in vector databases where GPU-accelerated search can quickly retrieve relevant data. 

Embedding models play a crucial role in creating intelligent agents. For example, they support retrieval-augmented generation (RAG) workflows, enabling agents to pull relevant information from diverse data sources and improve accuracy through in-context learning. 

CompanyModelDescriptionUse Cases
NVIDIANV-CLIPMultimodal foundation model generating text and image embeddingsMultimodal search, Zero-shot classification
NVIDIANV-DINOv2Vision foundation model generating high-resolution image embeddingsSimilarity search, Few-shot classification
Table 2. Embedding NIM microservices

Computer vision models

CV models focus on specialized tasks like image classification, object detection, and optical character recognition (OCR). These models can augment VLMs by adding detailed metadata, improving the overall intelligence of AI agents. 

CompanyModelDescriptionUse Cases
NVIDIAGrounding DinoOpen-vocabulary object detectionDetect anything
NVIDIAOCDRNetOptical character detection and recognitionDocument parsing
NVIDIAChangeNetDetects pixel-level changes between two images Defect detection, satellite imagery analysis
NVIDIARetail Object DetectionPretrained to detect common retail items Loss prevention
Table 3. Computer vision NIM microservices

Build visual AI agents with vision NIM microservices

Here are real-world examples of how the vision NIM microservices can be applied to create powerful visual AI agents. 

To make application development with NVIDIA NIM microservices more accessible, we have published a collection of examples on GitHub. These examples demonstrate how to use NIM APIs to build or integrate them into your applications. Each example includes a Jupyter notebook tutorial and demo that can be easily launched, even without GPUs.

On the NVIDIA API Catalog, select a model page, such as Llama 3.1 405B. Choose Get API Key and enter your business email for a 90-day NVIDIA AI Enterprise license, or use your personal email to access NIM through the NVIDIA Developer Program.

On the /NVIDIA/metropolis-nim-workflows GitHub repo, explore the Jupyter notebook tutorials and demos. These workflows showcase how vision NIM microservices can be combined with other components, like vector databases and LLMs to build powerful AI agents that solve real-world problems. With your API key, you can easily recreate the workflows showcased in this post, giving you hands-on experience with Vision NIM microservices.

Here are a few example workflows:

VLM streaming video alerts agent

With vast amounts of video data generated every second, it’s impossible to manually review footage for key events like package deliveries, forest fires, or unauthorized access. 

This workflow shows how to use VLMs, Python, and OpenCV to build an AI agent that autonomously monitors live streams for user-defined events. When an event is detected, an alert is generated, saving countless hours of manual video review. Thanks to the flexibility of VLMs, new events can be detected by changing the prompt—no need for custom CV models to be built and trained for each new scenario..

Video 1. Visual AI Agent Powered by NVIDIA NIM

In Figure 2, the VLM runs in the cloud while the video streaming pipeline operates locally. This setup enables the demo to run on almost any hardware, with the heavy computation offloaded to the cloud through NIM microservices. 

An architecture diagram shows the input of a video stream to frame decode and subsampling step, while a user alert creates a request for the VLM NIM microservice. The response is parsed and goes to overlay generation and a WebSocket server for the alert notification.
Figure 2. Streaming video alert agent architecture

Here are the steps for building this agent:

  1. Load and process the video stream: Use OpenCV to load a video stream or file, decode it, and subsample frames.
  2. Create REST API endpoints: Use FastAPI to create control REST API endpoints where users can input custom prompts.
  3. Integrate with the VLM API: A wrapper class handles interactions with the VLM API by sending video frames and user prompts. It forms the NIM API requests and parses the response. 
  4. Overlay responses on video: The VLM response is overlaid onto the input video, streamed out using OpenCV for real-time viewing. 
  5. Trigger alerts: Send the parsed response over a WebSocket server to integrate with other services, triggering notifications based on detected events. 

For more information about building a VLM-powered streaming video alert agent, see the /NVIDIA/metropolis-nim-workflows notebook tutorial and demo on GitHub. You can experiment with different VLM NIM microservices to find the best model for your use case. 

For more information about how VLMs can transform edge applications with NVIDIA Jetson and Jetson Platform Services, see Develop Generative AI-Powered Visual AI Agents for the Edge and explore additional resources on the Jetson Platform Services page.

Structured text extraction agent

Many business documents are stored as images rather than searchable formats like PDFs. This presents a significant challenge when it comes to searching and processing these documents, often requiring manual review, tagging, and organizing. 

While optical character detection and recognition (OCDR) models have been around for a while, they often return cluttered results that fail to retain the original formatting or interpret its visual data. This becomes especially challenging when working with documents in irregular formats, such as photo IDs, which come in various shapes and sizes. 

Traditional CV models make processing such documents time-consuming and costly. However, by combining the flexibility of VLMs and LLMs with the precision of OCDR models, you can build a powerful text-extraction pipeline to autonomously parse documents and store user-defined fields in a database. 

An architecture diagram shows form requests and responses to and from the OCDR, VLM, and LLM NIM microservices, plus the steps of combining OCD metadata with the prompt, the LLM formatting prompt, and the parsing of the formatted response.
Figure 3. Structured text extraction agent architecture

Here are the structured text-extraction pipeline building steps: 

  1. Document input: Provide an image of the document to an OCDR model, such as OCDRNet or Florence, which returns metadata for all the detected characters in the document. 
  2. VLM integration: The VLM processes the user’s prompt specifying the desired fields and analyzes the document. It uses the detected characters from the OCDR model to generate a more accurate response. 
  3. LLM formatting: The response of the VLM is passed to an LLM, which formats the data into JSON, presenting it as a table. 
  4. Output and storage: The extracted fields are now in a structured format, ready to be inserted into a database or stored for future use. 
Screenshots of the microservice extracting text from a photo ID. The screenshots include specifying the model options for VLM, OCDR, and LLM, plus the user-defined fields and the structured output, filled out with the information from the actual ID.
Figure 4. Structured text extraction example with vision NIM microservices

The preview APIs make it easy to experiment by combining multiple models to build complex pipelines. From the demo UI, you can switch between different VLMs, OCDR, and LLM models available on build.nvidia.com for quick experimentation. 

Few-shot classification with NV-DINOv2 

NV-DINOv2 generates embeddings from high-resolution images, making it ideal for tasks requiring detailed analysis, such as defect detection with only a few sample images. This workflow demonstrates how to build a scalable few-shot classification pipeline using NV-DINOv2 and a Milvus vector database. 

An architecture diagram shows how to embed and store few-shot examples and how to inference new images.
Figure 5. Few-shot classification with NV-DINOv2

Here is how the few-shot classification pipeline works:

  1. Define classes and upload samples: Users define classes and upload a few sample images for each. NV-DINOv2 generates embeddings from these images, which are then stored in a Milvus vector database along with the class labels. 
  2. Predict new classes: When a new image is uploaded, NV-DINOv2 generates its embedding, which is compared with the stored embeddings in the vector database. The closest neighbors are identified using the k-nearest neighbors (k-NN) algorithm, and the majority class among them is predicted. 

Multimodal search with NV-CLIP

NV-CLIP offers a unique advantage: the ability to embed both text and images, enabling multimodal search. By converting text and image inputs into embeddings within the same vector space, NV-CLIP facilitates the retrieval of images that match a given text query. This enables highly flexible and accurate search results. 

GIF shows a search for school bus images using natural language prompts with NV-CLIP. 
Figure 6. Multimodal search (image and text) with NV-CLIP

In this workflow, users upload a folder of images, which are embedded and stored in a vector database. Using the UI, they can type a query, and NV-CLIP retrieves the most similar images based on the input text. 

More advanced agents can be built using this approach with VLMs to create multimodal RAG workflows, enabling visual AI agents to build on past experiences and improve responses. 

Get started with visual AI agents today 

Ready to dive in and start building your own visual AI agents? Use the code provided in the /NVIDIA/metropolis-nim-workflows GitHub repo as a foundation to develop your own custom workflows and AI solutions powered by NIM microservices. Let the example inspire new applications that solve your specific challenges.

For any technical questions or support, join our community and engage with experts in the NVIDIA Visual AI Agent forum.

Discuss (0)

Tags