feat(jetson): add JetPack 7.2.0 inference image for Thor by AadhavSivakumar · Pull Request #2531 · roboflow/inference

AadhavSivakumar · 2026-06-29T18:27:15Z

What & why

Adds a JetPack 7.2.0 inference-server image for NVIDIA Thor (sm_110), derived from the current 7.1.0 image. JetPack 7.2 ships a newer BSP than 7.1.0 targets, so the image needs the updated component stack.

The branch was rebased onto main (inference 1.3.3) and the 7.2.0 Dockerfile re-derived from the current 7.1.0 — which now builds Triton from source. This matters: without Triton the image cannot run the merged RF-DETR instance-segmentation fast path (#2464), and would ship already behind main. See Tier 3 below.

Component stack (confirmed by reading a Thor device flashed with JetPack 7.2):

	JetPack 7.1.0	JetPack 7.2.0 (this PR)
NVIDIA Jetson apt repo	`r38.4`	`r39.2`
L4T	38.4.0	39.2.0
CUDA	13.0 (`cuda-toolkit-13-0`)	13.2 (`cuda-toolkit-13-2`)
cuDNN	9.x	9.20.0.46
TensorRT	10.13	10.16.2.10
PyTorch / torchvision	2.10.0 / 0.25.0 (source)	2.10.0 / 0.25.0 (source)
ONNX Runtime	1.24.2 (source)	1.24.2 (source)
Triton	3.6.0 (source)	3.6.0 (source)
flash-attn	2.8.3	main @ `d16e381` (sm_110)

Approach: compile from source (not prebuilt wheels)

7.1.0 compiles PyTorch / torchvision / ONNX Runtime / Triton / flash-attn from source because no prebuilt CUDA-13 aarch64 wheels existed. Prebuilt wheels exist now, but I evaluated both on the Thor device and kept the source build, with evidence:

jetson-ai-lab sbsa/cu130 torch wheels need undeclared system libs (NVPL libnvpl_lapack…, cuDSS libcudss.so.0) absent from JetPack's apt repo. Satisfying them via pip drags in cuda-toolkit 13.3 + nvidia-cublas 13.5, which skew against the system CUDA 13.2 that TensorRT and our source-built ONNX Runtime link → fragile.
download.pytorch.org/whl/cu132 torch 2.12 bundles pip nvidia-cudnn-cu13 9.20.0.48 (skew vs system 9.20.0.46), Thor-broken Triton, and ~3 GB of bloat.
The source build links apt OpenBLAS (USE_MKLDNN=0 USE_OPENMP=0) + system CUDA → exactly one clean CUDA stack in the image.

Two version decisions that needed care

ONNX Runtime stays at 1.24.2 (not bumped to 1.26.0). 1.26.0 (and any ≥1.25.0) fails to compile against JetPack 7.2's CUDA 13.2: ORT 1.25.0 added a <cub/cub.cuh> umbrella include (cu_inc/cub.cuh → common.cuh) that pulls cub/device/device_transform.cuh, where CUDA 13.2's bundled CCCL emits ill-formed C++ (struct ::cuda::proclaims_copyable_arguments<…> : ::cuda::std::true_type → gcc 13/14: "global qualification of class name is invalid before ':' token"). Tracked upstream: NVIDIA/cccl#8833, onnxruntime#28023; fixed by cccl#8771 but not yet in a shipped CUDA toolkit. 1.24.2 predates the include and builds clean (it's also what 7.1.0 ships; its TensorRT-EP still auto-detects TRT 10.16).

flash-attn built from a pinned main commit (d16e381) with FLASH_ATTN_CUDA_ARCHS=110. Tagged releases up to v2.8.3.post1 only emit sm_80/90/100/120 (their setup.py has no sm_110 branch), so on Thor they fail at runtime with no kernel image is available for execution on the device. main adds the CUDA-13 "Thor rename" gencode (compute_110f/sm_110). Restricting to arch 110 matches the rest of the image (torch + ORT are sm_110-only) and speeds the build.

Changes

docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0 — new; re-derived from the current 7.1.0. r39.2 repo, CUDA 13.2 (builder + runtime), L4T_VERSION=39.2.0, the ORT/flash-attn pins above, Triton 3.6.0 built from source (+ cuda-nvcc-13-2 and the runtime ptxas/build-essential deps it needs for sm_110 JIT). Build parallelism exposed as args (PYTORCH_MAX_JOBS/ORT_BUILD_PARALLEL/FLASH_ATTN_MAX_JOBS, Depot-safe defaults); BuildKit cache ids version-scoped (jp72/cu132).
.github/workflows/docker.jetson.7.2.0.yml — new; mirrors the 7.1.0 Depot linux/arm64 build.
inference_cli/lib/container_adapter.py — auto-detect L4T 39 / JetPack 7.2 → the 7.2.0 image (sorted-descending _JetsonImage entry; "7.2" matched before the "7" catch-all).
tests/inference_cli/unit_tests/lib/test_container_adapter.py — cases for L4T 39 → 7.2.0 and jetpack 7.2 / 7.2.0 / 7.2-b187 → 7.2.0 (61 tests pass).

Validation & benchmarks

Built natively on a Thor device (aarch64, sm_110, JetPack 7.2) and validated in-container, key-free (public COCO weights). All FPS are batch-1, single-stream.

⚠️ Measurement prerequisite — MAXN + jetson_clocks. The Thor ships in the 120 W power mode with the GPU on the dynamic governor, which idles at its 315 MHz minimum. Inference is bursty (a short GPU forward between ~15 ms of CPU preprocessing), so the governor never sees sustained GPU load and never ramps — the whole pipeline runs at ~20 % of GPU clock, making a Thor look slower than an Orin Nano. Every number below is measured after:
sudo nvpmodel -m 0     # MAXN
sudo jetson_clocks     # lock GPU 1575MHz / CPU 2601MHz / EMC 4266MHz
This alone is worth ~1.6–1.7× end-to-end (e.g. seg-nano native-res fast path 62 → 99 FPS). nvpmodel persists across reboot; jetson_clocks does not — deployments should re-apply it at boot or the numbers silently regress.

✅ PyTorch 2.10.0 CUDA on NVIDIA Thor cc (11,0); torchvision 0.25.0 NMS; flash-attn (sm_110) kernel.
✅ ONNX Runtime 1.24.2 CUDA and TensorRT EPs — whole RF-DETR graph runs on TRT (single subgraph, no CPU fallback).
✅ Triton 3.6.0 (built from source in this image) imports and JIT-compiles kernels on Thor sm_110 — enables the RF-DETR seg fast path below.
✅ YOLOv8n (640²) runs via ORT TensorRT EP (fp16).

RF-DETR is reported in tiers, because the model forward is a small slice of wall-clock and CPU preprocessing dominates.

Tier 1 — model-forward throughput (what the Thor GPU delivers)

ORT TensorRT EP (fp16), synthetic fixed input, forward only:

OD model	input	TRT-EP	CUDA-EP	IS model	input	TRT-EP	CUDA-EP
nano	384²	700	189	seg-nano	312²	513	106
small	512²	490	107	seg-small	384²	403	83
medium	576²	376	79	seg-medium	432²	305	58
base	560²	410	78	seg-large	504²	244	42
large	704²	283	50	seg-xlarge	624²	150	20
xlarge	700²	205	28	seg-2xlarge	768²	83	12
2xlarge	880²	129	23

Tier 2 — deployed single-model end-to-end (`inference_models`, real 1080p input)

Full pre_process → forward → post_process on a 1920×1080 frame, deployed ONNX backend. TensorRT EP is the faster execution provider end-to-end for every model; CUDA EP remains the shipped default (matches 7.1.0), and TRT EP is opt-in via ONNXRUNTIME_EXECUTION_PROVIDERS:

OD model	e2e TRT-EP	e2e CUDA-EP	IS model	e2e TRT-EP	e2e CUDA-EP
nano	59	48	seg-nano	52	38
small	52	38	seg-small	51	34
medium	50	33	seg-medium	48	29
base	50	33	seg-large	46	24
large	45	25	seg-xlarge	38	14
xlarge	43	19	seg-2xlarge	30	9
2xlarge	36	16

Even so, e2e (nano 59) is far below the forward (700) — the ~15 ms CPU resize/normalize of a full 1080p frame dominates. That preprocessing wall is what the fast path and native-resolution input remove (below).

Tier 3 — RF-DETR instance-segmentation fast path (Triton + CUDA graphs + pipelining)

The merged CodeFlash optimization (#2464, inference ≥1.3.2): fused Triton GPU pre/post-processing + TensorRT CUDA graphs + depth-2 CPU/GPU pipeline scheduling, for the trt backend via InferencePipeline. This is why the rebase + Triton addition are in this PR. Measured on Thor across all six instance-segmentation sizes (TRT backend, @v3 block, enforce_dense_masks_in_inference_models: False, through InferencePipeline; 875-frame clips), baseline vs fast path on a 1080p stream and on each model's native-resolution stream:

seg model	native	baseline 1080p	fast path 1080p	fast path native-res
seg-nano	312²	12.8	31.1	99.9
seg-small	384²	12.3	28.5	83.1
seg-medium	432²	11.5	27.8	77.1
seg-large	504²	10.9	25.2	68.9
seg-xlarge	624²	9.0	21.2	47.9
seg-2xlarge	768²	8.5	19.4	35.9

The fast path is a consistent ~2.3–2.4× over baseline across the whole range. seg-nano's 99.9 FPS at native resolution matches the CodeFlash team's ~105 FPS on an Orin Nano — i.e. the Thor now performs as it should, and even seg-2xlarge stays real-time-capable. Enabled with:

INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED=true \
INFERENCE_MODELS_RFDETR_TRITON_POSTPROC_ENABLED=true \
RFDETR_PIPELINE_DEPTH=2 \
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true   # see note

(Constraints, per the optimization's authors: trt backend, @v3 seg block with non-dense masks, static batch 1, STRETCH_TO resize, via InferencePipeline.)

Note on CUDA graphs: the numbers above are without ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND. With it, the TRT engine I built locally for this validation (a plain OnnxParser FP16 build) fails CUDA-graph capture (cudaErrorStreamCaptureInvalidated) — it isn't graph-capturable. Platform-served TRT packages (built by inference_models' own compile path, which the authors validated with CUDA graphs) support it for additional gain on top.

Where the rest of the time goes — input resolution & stream decode

Feeding each model an input already at its exact size (no 1080p→model resize) collapses preprocessing, and the TRT-EP forward advantage carries straight through:

OD model	1080p	exact-size	IS model	1080p	exact-size
nano	59	243	seg-nano	52	230
small	52	184	seg-small	51	190
medium	50	145	seg-medium	48	164
base	50	153	seg-large	46	132
large	45	109	seg-xlarge	38	85
xlarge	43	95	seg-2xlarge	30	52
2xlarge	36	63

Full stream end-to-end through InferencePipeline over real RTSP — the live geary SF-cam (30 fps) and a local 120 fps source (so configs above 30 fps aren't capped), seg-nano:

source	resolution	fast path
geary live SF cam (30 fps)	1080p	~28 FPS
local 120 fps RTSP	1080p	31 FPS
local 120 fps RTSP	312² (native)	99 FPS

Takeaway: the 7.2 image's compute is not the limiter — forward is 83–700 FPS and native-resolution e2e is 30–243 FPS. Deployed FPS at 1080p is spent on whole-frame resize (removed by the Triton fast path this image enables) and stream decode + the synchronous execution engine (a separate platform-wide pipelining effort). The one thing that must not be forgotten: set MAXN + jetson_clocks, without which the device runs at ~20 %.

🤖 Generated with Claude Code

CLAassistant · 2026-06-29T18:27:21Z

All committers have signed the CLA.

Add a Jetson 7.2.0 image targeting JetPack 7.2 (L4T r39.2), derived from the current 7.1.0 image. docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0: NVIDIA Jetson repo r39.2, CUDA 13.2 (cuda-toolkit-13-2), cuDNN 9.20, TensorRT 10.16, L4T_VERSION=39.2.0. PyTorch 2.10.0 + torchvision 0.25.0 compiled from source for Thor (sm_110). - ONNX Runtime kept at 1.24.2: 1.25.0+ (incl. 1.26.0) fail to compile against CUDA 13.2's bundled CCCL (cub/device/device_transform.cuh -> gcc 'global qualification of class name is invalid'; NVIDIA/cccl#8833, onnxruntime#28023). - flash-attn built from a pinned main commit with FLASH_ATTN_CUDA_ARCHS=110: tagged releases (<=2.8.3.post1) emit no sm_110 kernels, so they fail on Thor at runtime ('no kernel image is available for execution on the device'). - Triton 3.6.0 built from source (mirrors current 7.1.0): required by the RF-DETR instance-segmentation Triton fast path (GPU pre/post-processing). - Build parallelism exposed as ARGs (Depot-safe defaults); version-scoped BuildKit cache ids (jp72/cu132). Also: CI workflow docker.jetson.7.2.0.yml (mirrors 7.1.0); container_adapter auto-detect L4T 39 / JetPack 7.2 -> 7.2.0 image; unit tests for the mapping. Built natively on a Thor device and validated key-free (public COCO weights): torch CUDA, torchvision, flash-attn sm_110, ONNX Runtime CUDA + TensorRT EPs, Triton JIT on sm_110. YOLOv8n 256 FPS (ORT TensorRT EP). RF-DETR forward throughput up to 563 FPS (nano, TRT EP). RF-DETR-seg-nano end-to-end through InferencePipeline: 7.6 -> 19.0 FPS (2.5x) with the Triton fast path enabled. Co-Authored-By: Claude Opus 4.8 <[email protected]>

AadhavSivakumar self-assigned this Jun 29, 2026

AadhavSivakumar force-pushed the add-jetson-jetpack-7.2.0 branch from 82b48b6 to 44a8389 Compare June 30, 2026 00:39

AadhavSivakumar requested a review from alexnorell June 30, 2026 19:01

AadhavSivakumar assigned alexnorell Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(jetson): add JetPack 7.2.0 inference image for Thor#2531

feat(jetson): add JetPack 7.2.0 inference image for Thor#2531
AadhavSivakumar wants to merge 1 commit into
mainfrom
add-jetson-jetpack-7.2.0

AadhavSivakumar commented Jun 29, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AadhavSivakumar commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Approach: compile from source (not prebuilt wheels)

Two version decisions that needed care

Changes

Validation & benchmarks

Tier 1 — model-forward throughput (what the Thor GPU delivers)

Tier 2 — deployed single-model end-to-end (inference_models, real 1080p input)

Tier 3 — RF-DETR instance-segmentation fast path (Triton + CUDA graphs + pipelining)

Where the rest of the time goes — input resolution & stream decode

Uh oh!

CLAassistant commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AadhavSivakumar commented Jun 29, 2026 •

edited

Loading

Tier 2 — deployed single-model end-to-end (`inference_models`, real 1080p input)

CLAassistant commented Jun 29, 2026 •

edited

Loading