OpenTelemetry GPU Collector is a lightweight, efficient COLLECTOR designed to collect GPU performance metrics and send them to an OpenTelemetry-compatible endpoint for monitoring and observability. This tool is particularly useful for monitoring GPUs in high-performance computing environments, AI/ML tasks, and LLMs.
- Collects detailed GPU performance metrics
- OpenTelemetry-native
- Lightweight and efficient
- Supports NVIDIA and AMD GPUs
- Docker installed on your system
You can quickly start using the OTel GPU Collector by pulling the Docker image:
docker pull ghcr.io/openlit/otel-gpu-collector:latest
Here's a quick example showing how to run the container with the required environment variables:
docker run --gpus all \
-e GPU_APPLICATION_NAME='chatbot' \
-e GPU_ENVIRONMENT='staging' \
-e OTEL_EXPORTER_OTLP_ENDPOINT="YOUR_OTEL_ENDPOINT" \
-e OTEL_EXPORTER_OTLP_HEADERS="YOUR_OTEL_HEADERS" \
ghcr.io/openlit/otel-gpu-collector:latest
Note: If you've deployed OpenLIT using Docker Compose, make sure to use the host's IP address or add OTel GPU Collector to the Docker Compose:
Docker Compose: Add the following config under `services`
otel-gpu-collector:
image: ghcr.io/openlit/otel-gpu-collector:latest
environment:
GPU_APPLICATION_NAME: 'chatbot'
GPU_ENVIRONMENT: 'staging'
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4318"
device_requests:
- driver: nvidia
count: all
capabilities: [gpu]
depends_on:
- otel-collector
restart: always
Host IP: Use the Host IP to connect to OTel Collector
OTEL_EXPORTER_OTLP_ENDPOINT="http://192.168.10.15:4318"
OTel GPU Collector supports several environment variables for configuration. Below is a table that describes each variable:
Environment Variable | Description | Default Value |
---|---|---|
GPU_APPLICATION_NAME |
Name of the application running on the GPU | default_app |
GPU_ENVIRONMENT |
Environment name (e.g., staging, production) | production |
OTEL_EXPORTER_OTLP_ENDPOINT |
OpenTelemetry OTLP endpoint URL | (required) |
OTEL_EXPORTER_OTLP_HEADERS |
Headers for authenticating with the OTLP endpoint | Ignore if using OpenLIT |
You can also collect GPU metrics directly using the OpenLIT SDK in your Python application. Here’s an example:
import openlit
openlit.init(collect_gpu_stats=True)
For more details, check out the OpenLIT documentation or the SDK source code.
Metric Name | Description | Unit | Type | Attributes |
---|---|---|---|---|
gpu.utilization |
GPU Utilization in percentage | percent |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.enc.utilization |
GPU encoder Utilization in percentage | percent |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.dec.utilization |
GPU decoder Utilization in percentage | percent |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.temperature |
GPU Temperature in Celsius | Celsius |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.fan_speed |
GPU Fan Speed (0-100) as an integer | Integer |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.memory.available |
Available GPU Memory in MB | MB |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.memory.total |
Total GPU Memory in MB | MB |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.memory.used |
Used GPU Memory in MB | MB |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.memory.free |
Free GPU Memory in MB | MB |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.power.draw |
GPU Power Draw in Watts | Watt |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
gpu.power.limit |
GPU Power Limit in Watts | Watt |
Gauge | telemetry.sdk.name , gen_ai.application_name , gen_ai.environment , gpu_index , gpu_name , gpu_uuid |
To build the Docker image yourself, you can clone the repository and execute the following commands:
# Clone the OpenLIT repository and set directory
git clone https://github.com/openlit/openlit.git
cd otel-gpu-collector
# Build the Docker image
docker build -t otel-gpu-collector .
We are dedicated to continuously improving OpenTelemetry GPU Collector. Here's a look at what's been accomplished and what's on the horizon:
Feature | Status |
---|---|
OpenTelmetry-native AMD GPU Monitoring | ✅ Completed |
OpenTelmetry-native NVIDIA GPU Monitoring | ✅ Completed |
Whether it's big or small, we love contributions 💚. Check out our Contribution guide to get started
Unsure where to start? Here are a few ways to get involved:
- Join our Slack or Discord community to discuss ideas, share feedback, and connect with both our team and the wider OpenLIT community.
Your input helps us grow and improve, and we're here to support you every step of the way.
Connect with OpenLIT community and maintainers for support, discussions, and updates:
- 🌟 If you like it, Leave a star on our GitHub
- 🌍 Join our Slack or Discord community for live interactions and questions.
- 🐞 Report bugs on our GitHub Issues to help us improve OpenLIT.
- 𝕏 Follow us on X for the latest updates and news.
OpenTelemetry GPU Collector is built and maintained by OpenLIT under the Apache-2.0 license.