Skip to content

feat(jetson): add JetPack 7.2.0 inference image for Thor#2531

Draft
AadhavSivakumar wants to merge 1 commit into
mainfrom
add-jetson-jetpack-7.2.0
Draft

feat(jetson): add JetPack 7.2.0 inference image for Thor#2531
AadhavSivakumar wants to merge 1 commit into
mainfrom
add-jetson-jetpack-7.2.0

Conversation

@AadhavSivakumar

@AadhavSivakumar AadhavSivakumar commented Jun 29, 2026

Copy link
Copy Markdown

What & why

Adds a JetPack 7.2.0 inference-server image for NVIDIA Thor (sm_110), derived from the current 7.1.0 image. JetPack 7.2 ships a newer BSP than 7.1.0 targets, so the image needs the updated component stack.

The branch was rebased onto main (inference 1.3.3) and the 7.2.0 Dockerfile re-derived from the current 7.1.0 — which now builds Triton from source. This matters: without Triton the image cannot run the merged RF-DETR instance-segmentation fast path (#2464), and would ship already behind main. See Tier 3 below.

Component stack (confirmed by reading a Thor device flashed with JetPack 7.2):

JetPack 7.1.0 JetPack 7.2.0 (this PR)
NVIDIA Jetson apt repo r38.4 r39.2
L4T 38.4.0 39.2.0
CUDA 13.0 (cuda-toolkit-13-0) 13.2 (cuda-toolkit-13-2)
cuDNN 9.x 9.20.0.46
TensorRT 10.13 10.16.2.10
PyTorch / torchvision 2.10.0 / 0.25.0 (source) 2.10.0 / 0.25.0 (source)
ONNX Runtime 1.24.2 (source) 1.24.2 (source)
Triton 3.6.0 (source) 3.6.0 (source)
flash-attn 2.8.3 main @ d16e381 (sm_110)

Approach: compile from source (not prebuilt wheels)

7.1.0 compiles PyTorch / torchvision / ONNX Runtime / Triton / flash-attn from source because no prebuilt CUDA-13 aarch64 wheels existed. Prebuilt wheels exist now, but I evaluated both on the Thor device and kept the source build, with evidence:

  • jetson-ai-lab sbsa/cu130 torch wheels need undeclared system libs (NVPL libnvpl_lapack…, cuDSS libcudss.so.0) absent from JetPack's apt repo. Satisfying them via pip drags in cuda-toolkit 13.3 + nvidia-cublas 13.5, which skew against the system CUDA 13.2 that TensorRT and our source-built ONNX Runtime link → fragile.
  • download.pytorch.org/whl/cu132 torch 2.12 bundles pip nvidia-cudnn-cu13 9.20.0.48 (skew vs system 9.20.0.46), Thor-broken Triton, and ~3 GB of bloat.
  • The source build links apt OpenBLAS (USE_MKLDNN=0 USE_OPENMP=0) + system CUDA → exactly one clean CUDA stack in the image.

Two version decisions that needed care

ONNX Runtime stays at 1.24.2 (not bumped to 1.26.0). 1.26.0 (and any ≥1.25.0) fails to compile against JetPack 7.2's CUDA 13.2: ORT 1.25.0 added a <cub/cub.cuh> umbrella include (cu_inc/cub.cuhcommon.cuh) that pulls cub/device/device_transform.cuh, where CUDA 13.2's bundled CCCL emits ill-formed C++ (struct ::cuda::proclaims_copyable_arguments<…> : ::cuda::std::true_type → gcc 13/14: "global qualification of class name is invalid before ':' token"). Tracked upstream: NVIDIA/cccl#8833, onnxruntime#28023; fixed by cccl#8771 but not yet in a shipped CUDA toolkit. 1.24.2 predates the include and builds clean (it's also what 7.1.0 ships; its TensorRT-EP still auto-detects TRT 10.16).

flash-attn built from a pinned main commit (d16e381) with FLASH_ATTN_CUDA_ARCHS=110. Tagged releases up to v2.8.3.post1 only emit sm_80/90/100/120 (their setup.py has no sm_110 branch), so on Thor they fail at runtime with no kernel image is available for execution on the device. main adds the CUDA-13 "Thor rename" gencode (compute_110f/sm_110). Restricting to arch 110 matches the rest of the image (torch + ORT are sm_110-only) and speeds the build.

Changes

  • docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0 — new; re-derived from the current 7.1.0. r39.2 repo, CUDA 13.2 (builder + runtime), L4T_VERSION=39.2.0, the ORT/flash-attn pins above, Triton 3.6.0 built from source (+ cuda-nvcc-13-2 and the runtime ptxas/build-essential deps it needs for sm_110 JIT). Build parallelism exposed as args (PYTORCH_MAX_JOBS/ORT_BUILD_PARALLEL/FLASH_ATTN_MAX_JOBS, Depot-safe defaults); BuildKit cache ids version-scoped (jp72/cu132).
  • .github/workflows/docker.jetson.7.2.0.yml — new; mirrors the 7.1.0 Depot linux/arm64 build.
  • inference_cli/lib/container_adapter.py — auto-detect L4T 39 / JetPack 7.2 → the 7.2.0 image (sorted-descending _JetsonImage entry; "7.2" matched before the "7" catch-all).
  • tests/inference_cli/unit_tests/lib/test_container_adapter.py — cases for L4T 39 → 7.2.0 and jetpack 7.2 / 7.2.0 / 7.2-b187 → 7.2.0 (61 tests pass).

Validation & benchmarks

Built natively on a Thor device (aarch64, sm_110, JetPack 7.2) and validated in-container, key-free (public COCO weights). All FPS are batch-1, single-stream.

⚠️ Measurement prerequisite — MAXN + jetson_clocks. The Thor ships in the 120 W power mode with the GPU on the dynamic governor, which idles at its 315 MHz minimum. Inference is bursty (a short GPU forward between ~15 ms of CPU preprocessing), so the governor never sees sustained GPU load and never ramps — the whole pipeline runs at ~20 % of GPU clock, making a Thor look slower than an Orin Nano. Every number below is measured after:

sudo nvpmodel -m 0     # MAXN
sudo jetson_clocks     # lock GPU 1575MHz / CPU 2601MHz / EMC 4266MHz

This alone is worth ~1.6–1.7× end-to-end (e.g. seg-nano native-res fast path 62 → 99 FPS). nvpmodel persists across reboot; jetson_clocks does not — deployments should re-apply it at boot or the numbers silently regress.

  • ✅ PyTorch 2.10.0 CUDA on NVIDIA Thor cc (11,0); torchvision 0.25.0 NMS; flash-attn (sm_110) kernel.
  • ✅ ONNX Runtime 1.24.2 CUDA and TensorRT EPs — whole RF-DETR graph runs on TRT (single subgraph, no CPU fallback).
  • Triton 3.6.0 (built from source in this image) imports and JIT-compiles kernels on Thor sm_110 — enables the RF-DETR seg fast path below.
  • YOLOv8n (640²) runs via ORT TensorRT EP (fp16).

RF-DETR is reported in tiers, because the model forward is a small slice of wall-clock and CPU preprocessing dominates.

Tier 1 — model-forward throughput (what the Thor GPU delivers)

ORT TensorRT EP (fp16), synthetic fixed input, forward only:

OD model input TRT-EP CUDA-EP IS model input TRT-EP CUDA-EP
nano 384² 700 189 seg-nano 312² 513 106
small 512² 490 107 seg-small 384² 403 83
medium 576² 376 79 seg-medium 432² 305 58
base 560² 410 78 seg-large 504² 244 42
large 704² 283 50 seg-xlarge 624² 150 20
xlarge 700² 205 28 seg-2xlarge 768² 83 12
2xlarge 880² 129 23

Tier 2 — deployed single-model end-to-end (inference_models, real 1080p input)

Full pre_process → forward → post_process on a 1920×1080 frame, deployed ONNX backend. TensorRT EP is the faster execution provider end-to-end for every model; CUDA EP remains the shipped default (matches 7.1.0), and TRT EP is opt-in via ONNXRUNTIME_EXECUTION_PROVIDERS:

OD model e2e TRT-EP e2e CUDA-EP IS model e2e TRT-EP e2e CUDA-EP
nano 59 48 seg-nano 52 38
small 52 38 seg-small 51 34
medium 50 33 seg-medium 48 29
base 50 33 seg-large 46 24
large 45 25 seg-xlarge 38 14
xlarge 43 19 seg-2xlarge 30 9
2xlarge 36 16

Even so, e2e (nano 59) is far below the forward (700) — the ~15 ms CPU resize/normalize of a full 1080p frame dominates. That preprocessing wall is what the fast path and native-resolution input remove (below).

Tier 3 — RF-DETR instance-segmentation fast path (Triton + CUDA graphs + pipelining)

The merged CodeFlash optimization (#2464, inference ≥1.3.2): fused Triton GPU pre/post-processing + TensorRT CUDA graphs + depth-2 CPU/GPU pipeline scheduling, for the trt backend via InferencePipeline. This is why the rebase + Triton addition are in this PR. Measured on Thor across all six instance-segmentation sizes (TRT backend, @v3 block, enforce_dense_masks_in_inference_models: False, through InferencePipeline; 875-frame clips), baseline vs fast path on a 1080p stream and on each model's native-resolution stream:

seg model native baseline 1080p fast path 1080p fast path native-res
seg-nano 312² 12.8 31.1 99.9
seg-small 384² 12.3 28.5 83.1
seg-medium 432² 11.5 27.8 77.1
seg-large 504² 10.9 25.2 68.9
seg-xlarge 624² 9.0 21.2 47.9
seg-2xlarge 768² 8.5 19.4 35.9

The fast path is a consistent ~2.3–2.4× over baseline across the whole range. seg-nano's 99.9 FPS at native resolution matches the CodeFlash team's ~105 FPS on an Orin Nano — i.e. the Thor now performs as it should, and even seg-2xlarge stays real-time-capable. Enabled with:

INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED=true \
INFERENCE_MODELS_RFDETR_TRITON_POSTPROC_ENABLED=true \
RFDETR_PIPELINE_DEPTH=2 \
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true   # see note

(Constraints, per the optimization's authors: trt backend, @v3 seg block with non-dense masks, static batch 1, STRETCH_TO resize, via InferencePipeline.)

Note on CUDA graphs: the numbers above are without ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND. With it, the TRT engine I built locally for this validation (a plain OnnxParser FP16 build) fails CUDA-graph capture (cudaErrorStreamCaptureInvalidated) — it isn't graph-capturable. Platform-served TRT packages (built by inference_models' own compile path, which the authors validated with CUDA graphs) support it for additional gain on top.

Where the rest of the time goes — input resolution & stream decode

Feeding each model an input already at its exact size (no 1080p→model resize) collapses preprocessing, and the TRT-EP forward advantage carries straight through:

OD model 1080p exact-size IS model 1080p exact-size
nano 59 243 seg-nano 52 230
small 52 184 seg-small 51 190
medium 50 145 seg-medium 48 164
base 50 153 seg-large 46 132
large 45 109 seg-xlarge 38 85
xlarge 43 95 seg-2xlarge 30 52
2xlarge 36 63

Full stream end-to-end through InferencePipeline over real RTSP — the live geary SF-cam (30 fps) and a local 120 fps source (so configs above 30 fps aren't capped), seg-nano:

source resolution fast path
geary live SF cam (30 fps) 1080p ~28 FPS
local 120 fps RTSP 1080p 31 FPS
local 120 fps RTSP 312² (native) 99 FPS

Takeaway: the 7.2 image's compute is not the limiter — forward is 83–700 FPS and native-resolution e2e is 30–243 FPS. Deployed FPS at 1080p is spent on whole-frame resize (removed by the Triton fast path this image enables) and stream decode + the synchronous execution engine (a separate platform-wide pipelining effort). The one thing that must not be forgotten: set MAXN + jetson_clocks, without which the device runs at ~20 %.

🤖 Generated with Claude Code

@CLAassistant

CLAassistant commented Jun 29, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@AadhavSivakumar AadhavSivakumar self-assigned this Jun 29, 2026
Add a Jetson 7.2.0 image targeting JetPack 7.2 (L4T r39.2), derived from the
current 7.1.0 image.

docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0: NVIDIA Jetson repo r39.2,
CUDA 13.2 (cuda-toolkit-13-2), cuDNN 9.20, TensorRT 10.16, L4T_VERSION=39.2.0.
PyTorch 2.10.0 + torchvision 0.25.0 compiled from source for Thor (sm_110).
- ONNX Runtime kept at 1.24.2: 1.25.0+ (incl. 1.26.0) fail to compile against
  CUDA 13.2's bundled CCCL (cub/device/device_transform.cuh -> gcc 'global
  qualification of class name is invalid'; NVIDIA/cccl#8833, onnxruntime#28023).
- flash-attn built from a pinned main commit with FLASH_ATTN_CUDA_ARCHS=110:
  tagged releases (<=2.8.3.post1) emit no sm_110 kernels, so they fail on Thor
  at runtime ('no kernel image is available for execution on the device').
- Triton 3.6.0 built from source (mirrors current 7.1.0): required by the
  RF-DETR instance-segmentation Triton fast path (GPU pre/post-processing).
- Build parallelism exposed as ARGs (Depot-safe defaults); version-scoped
  BuildKit cache ids (jp72/cu132).

Also: CI workflow docker.jetson.7.2.0.yml (mirrors 7.1.0); container_adapter
auto-detect L4T 39 / JetPack 7.2 -> 7.2.0 image; unit tests for the mapping.

Built natively on a Thor device and validated key-free (public COCO weights):
torch CUDA, torchvision, flash-attn sm_110, ONNX Runtime CUDA + TensorRT EPs,
Triton JIT on sm_110. YOLOv8n 256 FPS (ORT TensorRT EP). RF-DETR forward
throughput up to 563 FPS (nano, TRT EP). RF-DETR-seg-nano end-to-end through
InferencePipeline: 7.6 -> 19.0 FPS (2.5x) with the Triton fast path enabled.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants