otel-gpu-collector

OpenTelemetry GPU Collector

Documentation | Quickstart | Python SDK | Metrics

Roadmap | Feature Request | Report a Bug

OpenTelemetry GPU Collector is a lightweight, efficient COLLECTOR designed to collect GPU performance metrics and send them to an OpenTelemetry-compatible endpoint for monitoring and observability. This tool is particularly useful for monitoring GPUs in high-performance computing environments, AI/ML tasks, and LLMs.

⚡ Features

Collects detailed GPU performance metrics
OpenTelemetry-native
Lightweight and efficient
Supports NVIDIA and AMD GPUs

🚀 Getting Started with GPU Monitoring

Prerequisites

Docker installed on your system

Step 1: Pull the Docker Image

You can quickly start using the OTel GPU Collector by pulling the Docker image:

docker pull ghcr.io/openlit/otel-gpu-collector:latest

Step 2: Run the Container

Here's a quick example showing how to run the container with the required environment variables:

docker run --gpus all \
    -e GPU_APPLICATION_NAME='chatbot' \
    -e GPU_ENVIRONMENT='staging' \
    -e OTEL_EXPORTER_OTLP_ENDPOINT="YOUR_OTEL_ENDPOINT" \
    -e OTEL_EXPORTER_OTLP_HEADERS="YOUR_OTEL_HEADERS" \
    ghcr.io/openlit/otel-gpu-collector:latest

Note: If you've deployed OpenLIT using Docker Compose, make sure to use the host's IP address or add OTel GPU Collector to the Docker Compose:

Docker Compose: Add the following config under `services`

otel-gpu-collector:
  image: ghcr.io/openlit/otel-gpu-collector:latest
  environment:
    GPU_APPLICATION_NAME: 'chatbot'
    GPU_ENVIRONMENT: 'staging'
    OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4318"
  device_requests:
  - driver: nvidia
    count: all
    capabilities: [gpu]
  depends_on:
  - otel-collector
  restart: always

Host IP: Use the Host IP to connect to OTel Collector

OTEL_EXPORTER_OTLP_ENDPOINT="http://192.168.10.15:4318"

Environment Variables

OTel GPU Collector supports several environment variables for configuration. Below is a table that describes each variable:

Environment Variable	Description	Default Value
`GPU_APPLICATION_NAME`	Name of the application running on the GPU	`default_app`
`GPU_ENVIRONMENT`	Environment name (e.g., staging, production)	`production`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OpenTelemetry OTLP endpoint URL	(required)
`OTEL_EXPORTER_OTLP_HEADERS`	Headers for authenticating with the OTLP endpoint	Ignore if using OpenLIT

Alternative: Using OpenLIT SDK

You can also collect GPU metrics directly using the OpenLIT SDK in your Python application. Here’s an example:

import openlit

openlit.init(collect_gpu_stats=True)

For more details, check out the OpenLIT documentation or the SDK source code.

Metrics

Metric Name	Description	Unit	Type	Attributes
`gpu.utilization`	GPU Utilization in percentage	`percent`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.enc.utilization`	GPU encoder Utilization in percentage	`percent`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.dec.utilization`	GPU decoder Utilization in percentage	`percent`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.temperature`	GPU Temperature in Celsius	`Celsius`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.fan_speed`	GPU Fan Speed (0-100) as an integer	`Integer`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.memory.available`	Available GPU Memory in MB	`MB`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.memory.total`	Total GPU Memory in MB	`MB`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.memory.used`	Used GPU Memory in MB	`MB`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.memory.free`	Free GPU Memory in MB	`MB`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.power.draw`	GPU Power Draw in Watts	`Watt`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`
`gpu.power.limit`	GPU Power Limit in Watts	`Watt`	Gauge	`telemetry.sdk.name`, `gen_ai.application_name`, `gen_ai.environment`, `gpu_index`, `gpu_name`, `gpu_uuid`

Building the Docker Image

To build the Docker image yourself, you can clone the repository and execute the following commands:

# Clone the OpenLIT repository and set directory
git clone https://github.com/openlit/openlit.git
cd otel-gpu-collector

# Build the Docker image
docker build -t otel-gpu-collector .

🛣️ Roadmap

We are dedicated to continuously improving OpenTelemetry GPU Collector. Here's a look at what's been accomplished and what's on the horizon:

Feature	Status
OpenTelmetry-native AMD GPU Monitoring	✅ Completed
OpenTelmetry-native NVIDIA GPU Monitoring	✅ Completed

🌱 Contributing

Whether it's big or small, we love contributions 💚. Check out our Contribution guide to get started

Unsure where to start? Here are a few ways to get involved:

Join our Slack or Discord community to discuss ideas, share feedback, and connect with both our team and the wider OpenLIT community.

Your input helps us grow and improve, and we're here to support you every step of the way.

💚 Community & Support

Connect with OpenLIT community and maintainers for support, discussions, and updates:

🌟 If you like it, Leave a star on our GitHub
🌍 Join our Slack or Discord community for live interactions and questions.
🐞 Report bugs on our GitHub Issues to help us improve OpenLIT.
𝕏 Follow us on X for the latest updates and news.

License

OpenTelemetry GPU Collector is built and maintained by OpenLIT under the Apache-2.0 license.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
collector.py		collector.py
dockerfile		dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otel-gpu-collector

otel-gpu-collector

README.md

OpenTelemetry GPU Collector

⚡ Features

🚀 Getting Started with GPU Monitoring

Prerequisites

Step 1: Pull the Docker Image

Step 2: Run the Container

Environment Variables

Alternative: Using OpenLIT SDK

Metrics

Building the Docker Image

🛣️ Roadmap

🌱 Contributing

💚 Community & Support

License

Files

otel-gpu-collector

Directory actions

More options

Directory actions

More options

Latest commit

History

otel-gpu-collector

Folders and files

parent directory

README.md

OpenTelemetry GPU Collector

⚡ Features

🚀 Getting Started with GPU Monitoring

Prerequisites

Step 1: Pull the Docker Image

Step 2: Run the Container

Environment Variables

Alternative: Using OpenLIT SDK

Metrics

Building the Docker Image

🛣️ Roadmap

🌱 Contributing

💚 Community & Support

License