feat(jetson): add JetPack 7.2.0 inference image for Thor#2531
Draft
AadhavSivakumar wants to merge 1 commit into
Draft
feat(jetson): add JetPack 7.2.0 inference image for Thor#2531AadhavSivakumar wants to merge 1 commit into
AadhavSivakumar wants to merge 1 commit into
Conversation
Add a Jetson 7.2.0 image targeting JetPack 7.2 (L4T r39.2), derived from the current 7.1.0 image. docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0: NVIDIA Jetson repo r39.2, CUDA 13.2 (cuda-toolkit-13-2), cuDNN 9.20, TensorRT 10.16, L4T_VERSION=39.2.0. PyTorch 2.10.0 + torchvision 0.25.0 compiled from source for Thor (sm_110). - ONNX Runtime kept at 1.24.2: 1.25.0+ (incl. 1.26.0) fail to compile against CUDA 13.2's bundled CCCL (cub/device/device_transform.cuh -> gcc 'global qualification of class name is invalid'; NVIDIA/cccl#8833, onnxruntime#28023). - flash-attn built from a pinned main commit with FLASH_ATTN_CUDA_ARCHS=110: tagged releases (<=2.8.3.post1) emit no sm_110 kernels, so they fail on Thor at runtime ('no kernel image is available for execution on the device'). - Triton 3.6.0 built from source (mirrors current 7.1.0): required by the RF-DETR instance-segmentation Triton fast path (GPU pre/post-processing). - Build parallelism exposed as ARGs (Depot-safe defaults); version-scoped BuildKit cache ids (jp72/cu132). Also: CI workflow docker.jetson.7.2.0.yml (mirrors 7.1.0); container_adapter auto-detect L4T 39 / JetPack 7.2 -> 7.2.0 image; unit tests for the mapping. Built natively on a Thor device and validated key-free (public COCO weights): torch CUDA, torchvision, flash-attn sm_110, ONNX Runtime CUDA + TensorRT EPs, Triton JIT on sm_110. YOLOv8n 256 FPS (ORT TensorRT EP). RF-DETR forward throughput up to 563 FPS (nano, TRT EP). RF-DETR-seg-nano end-to-end through InferencePipeline: 7.6 -> 19.0 FPS (2.5x) with the Triton fast path enabled. Co-Authored-By: Claude Opus 4.8 <[email protected]>
82b48b6 to
44a8389
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Adds a JetPack 7.2.0 inference-server image for NVIDIA Thor (
sm_110), derived from the current 7.1.0 image. JetPack 7.2 ships a newer BSP than 7.1.0 targets, so the image needs the updated component stack.Component stack (confirmed by reading a Thor device flashed with JetPack 7.2):
r38.4r39.2cuda-toolkit-13-0)cuda-toolkit-13-2)d16e381(sm_110)Approach: compile from source (not prebuilt wheels)
7.1.0 compiles PyTorch / torchvision / ONNX Runtime / Triton / flash-attn from source because no prebuilt CUDA-13 aarch64 wheels existed. Prebuilt wheels exist now, but I evaluated both on the Thor device and kept the source build, with evidence:
sbsa/cu130torch wheels need undeclared system libs (NVPLlibnvpl_lapack…, cuDSSlibcudss.so.0) absent from JetPack's apt repo. Satisfying them via pip drags incuda-toolkit 13.3+nvidia-cublas 13.5, which skew against the system CUDA 13.2 that TensorRT and our source-built ONNX Runtime link → fragile.download.pytorch.org/whl/cu132torch 2.12 bundles pipnvidia-cudnn-cu13 9.20.0.48(skew vs system 9.20.0.46), Thor-broken Triton, and ~3 GB of bloat.USE_MKLDNN=0 USE_OPENMP=0) + system CUDA → exactly one clean CUDA stack in the image.Two version decisions that needed care
ONNX Runtime stays at 1.24.2 (not bumped to 1.26.0). 1.26.0 (and any ≥1.25.0) fails to compile against JetPack 7.2's CUDA 13.2: ORT 1.25.0 added a
<cub/cub.cuh>umbrella include (cu_inc/cub.cuh→common.cuh) that pullscub/device/device_transform.cuh, where CUDA 13.2's bundled CCCL emits ill-formed C++ (struct ::cuda::proclaims_copyable_arguments<…> : ::cuda::std::true_type→ gcc 13/14: "global qualification of class name is invalid before ':' token"). Tracked upstream: NVIDIA/cccl#8833, onnxruntime#28023; fixed by cccl#8771 but not yet in a shipped CUDA toolkit. 1.24.2 predates the include and builds clean (it's also what 7.1.0 ships; its TensorRT-EP still auto-detects TRT 10.16).flash-attn built from a pinned
maincommit (d16e381) withFLASH_ATTN_CUDA_ARCHS=110. Tagged releases up to v2.8.3.post1 only emit sm_80/90/100/120 (theirsetup.pyhas no sm_110 branch), so on Thor they fail at runtime withno kernel image is available for execution on the device.mainadds the CUDA-13 "Thor rename" gencode (compute_110f/sm_110). Restricting to arch 110 matches the rest of the image (torch + ORT are sm_110-only) and speeds the build.Changes
docker/dockerfiles/Dockerfile.onnx.jetson.7.2.0— new; re-derived from the current 7.1.0. r39.2 repo, CUDA 13.2 (builder + runtime),L4T_VERSION=39.2.0, the ORT/flash-attn pins above, Triton 3.6.0 built from source (+cuda-nvcc-13-2and the runtime ptxas/build-essentialdeps it needs for sm_110 JIT). Build parallelism exposed as args (PYTORCH_MAX_JOBS/ORT_BUILD_PARALLEL/FLASH_ATTN_MAX_JOBS, Depot-safe defaults); BuildKit cache ids version-scoped (jp72/cu132)..github/workflows/docker.jetson.7.2.0.yml— new; mirrors the 7.1.0 Depotlinux/arm64build.inference_cli/lib/container_adapter.py— auto-detect L4T 39 / JetPack 7.2 → the 7.2.0 image (sorted-descending_JetsonImageentry;"7.2"matched before the"7"catch-all).tests/inference_cli/unit_tests/lib/test_container_adapter.py— cases for L4T 39 → 7.2.0 and jetpack7.2/7.2.0/7.2-b187→ 7.2.0 (61 tests pass).Validation & benchmarks
Built natively on a Thor device (aarch64, sm_110, JetPack 7.2) and validated in-container, key-free (public COCO weights). All FPS are batch-1, single-stream.
NVIDIA Thorcc (11,0); torchvision 0.25.0 NMS; flash-attn (sm_110) kernel.RF-DETR is reported in tiers, because the model forward is a small slice of wall-clock and CPU preprocessing dominates.
Tier 1 — model-forward throughput (what the Thor GPU delivers)
ORT TensorRT EP (fp16), synthetic fixed input, forward only:
Tier 2 — deployed single-model end-to-end (
inference_models, real 1080p input)Full
pre_process → forward → post_processon a 1920×1080 frame, deployed ONNX backend. TensorRT EP is the faster execution provider end-to-end for every model; CUDA EP remains the shipped default (matches 7.1.0), and TRT EP is opt-in viaONNXRUNTIME_EXECUTION_PROVIDERS:Even so, e2e (nano 59) is far below the forward (700) — the ~15 ms CPU resize/normalize of a full 1080p frame dominates. That preprocessing wall is what the fast path and native-resolution input remove (below).
Tier 3 — RF-DETR instance-segmentation fast path (Triton + CUDA graphs + pipelining)
The merged CodeFlash optimization (#2464, inference ≥1.3.2): fused Triton GPU pre/post-processing + TensorRT CUDA graphs + depth-2 CPU/GPU pipeline scheduling, for the
trtbackend viaInferencePipeline. This is why the rebase + Triton addition are in this PR. Measured on Thor across all six instance-segmentation sizes (TRT backend,@v3block,enforce_dense_masks_in_inference_models: False, throughInferencePipeline; 875-frame clips), baseline vs fast path on a 1080p stream and on each model's native-resolution stream:The fast path is a consistent ~2.3–2.4× over baseline across the whole range. seg-nano's 99.9 FPS at native resolution matches the CodeFlash team's ~105 FPS on an Orin Nano — i.e. the Thor now performs as it should, and even seg-2xlarge stays real-time-capable. Enabled with:
(Constraints, per the optimization's authors:
trtbackend,@v3seg block with non-dense masks, static batch 1,STRETCH_TOresize, viaInferencePipeline.)Note on CUDA graphs: the numbers above are without
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND. With it, the TRT engine I built locally for this validation (a plainOnnxParserFP16 build) fails CUDA-graph capture (cudaErrorStreamCaptureInvalidated) — it isn't graph-capturable. Platform-served TRT packages (built byinference_models' own compile path, which the authors validated with CUDA graphs) support it for additional gain on top.Where the rest of the time goes — input resolution & stream decode
Feeding each model an input already at its exact size (no 1080p→model resize) collapses preprocessing, and the TRT-EP forward advantage carries straight through:
Full stream end-to-end through
InferencePipelineover real RTSP — the live geary SF-cam (30 fps) and a local 120 fps source (so configs above 30 fps aren't capped), seg-nano:Takeaway: the 7.2 image's compute is not the limiter — forward is 83–700 FPS and native-resolution e2e is 30–243 FPS. Deployed FPS at 1080p is spent on whole-frame resize (removed by the Triton fast path this image enables) and stream decode + the synchronous execution engine (a separate platform-wide pipelining effort). The one thing that must not be forgotten: set MAXN +
jetson_clocks, without which the device runs at ~20 %.🤖 Generated with Claude Code