GPU Deployment and Management Documentation

CUDA Compatibility document describes the use of new CUDA toolkit components on systems with older base installations.

The NVIDIA Management Library reference.

The nvidia-smi man page.

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs

Driver Persistence

Any interactions with NVIDIA GPUs require that an instance of the kernel mode driver be running. This driver may be persistent in some environments and transient in others. This document describes the default driver behavior and options for modifying that behavior.

NVIDIA Validation Suite User Guide

NVVS is the system administrator and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIATesla GPUs in a high performance computing environments. NVVS focuses on software and system configuration issues, diagnostics, topological concerns, and relative performance.

HW Field Diag

The HW field diag is the comprehensive tool for verifying GPU HW integrity in the field, and a required piece of the RMA process.

RMA Process

A standardized process must be followed to identify products that qualify for RMA. This document provides an overview of that process.

Dynamic Page Retirement

The NVIDIA driver supports "retiring" framebuffer pages that contain bad memory cells. This is called "dynamic page retirement" and is done automatically for cells that are degrading in quality. This feature can improve the longevity of an otherwise good board and and is thus an important resiliency feature on supported products, especially in HPC and enterprise environments.

NVIDIA GPU Memory Error Management

This document describes the new memory error recovery features introduced in the NVIDIA® 100 GPU and NVIDIA 800 GPU.

XID Errors

This document explains what Xid messages are, and is intended to assist system administrators, developers, and FAEs in understanding the meaning behind these messages as an aid in analyzing and resolving GPU-related problems.

NVIDIA GPU Debug Guidelines

This document provides GPU error debug and diagnosis guidelines, and is intended to assist system administrators, developers and FAEs get servers back up and running as quickly as possible.

Compatibility

Monitoring & Management

Health & Diagnostics