Skip to content

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy #253

Closed
@levipereira

Description

This is outdated
follow this new repo
https://github.com/levipereira/yolov9-qat

Please follow The Original Implementation in #327

@WongKinYiu

I have developed the initial version of YOLOv9-QAT using the Q/DQ method, tailored specifically for YOLOv9 models intended for execution solely on TensorRT.

This implementation currently supports only the Inference Models (Converted and Gelan models).

The source code in available the yolov9-qat branch.

Challenges

Quantizing all layers in some cases can decreases accuracy and increases latency, primarily due to the complexity of the last layer. To mitigate this, utilize the qat.py quantize --no-last-layer flag to exclude the last layer from quantization.

This version we have unoptimized scaling of Quantize/Dequantize (Q/DQ) could lead to generating unnecessary data formats. Implementing restrictions on the scale of Q/DQ on models/quantize.py to match the data format is essential to decrease latency perfomance.
The contributions from the community, as their knowledge is essential for the correct implementation of this functionality.

Files Added / Modified

qat.py - Main

usage: qat.py [-h] {quantize,sensitive,eval} ...
positional arguments:
  {quantize,sensitive,eval}
    quantize            PTQ/QAT finetune ...
    sensitive           Sensitive layer analysis
    eval                Do evaluate

models/quantize.py - Quantize Module
models/quantize_rules.py - Quantize Rules
export.py - Changed to Automatically detect QAT Models and Export when using flag --include onnx / onnx_end2end

Accuracy Report

QAT YOLOV9-C - ALL LAYERS 
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634
PQT        | 0.5295   | 0.6978   | 0.7455     | 0.6306
QAT- Best  | 0.5291   | 0.6978   | 0.7449     | 0.632

QAT - YOLOV9-C  - NO QAT LAST LAYER 
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634   
PQT        | 0.529    | 0.698    | 0.7459     | 0.6297  
QAT- Best  | 0.5299   | 0.6984   | 0.7469     | 0.6305  

QAT - YOLOV9-E ALL-LAYERS
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649
PQT        | 0.5565   | 0.7241   | 0.7499     | 0.6649
QAT- Best  | 0.5566   | 0.7232   | 0.7538     | 0.6637


QAT - YOLOV9-E  - NO QAT  LAST LAYER
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649  
PQT        | 0.5569   | 0.7242   | 0.7497     | 0.6646  
QAT- Best  | 0.5569   | 0.7239   | 0.7486     | 0.6657  



Result using TensorRT engine Models on Triton-Server
Tool: https://github.com/levipereira/triton-client-yolo

========================= EVALUATION SUMMARY - YOLOV9-C ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.652
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.701
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.538
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.848
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.701
[email protected]:     0.577
================================================================================


========================= EVALUATION SUMMARY - YOLOV9-C-QAT ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.699
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.576
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.359
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.692
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.651
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.699
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.534
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.758
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.845
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.699
[email protected]:     0.576
================================================================================

Latency Report

  • Device Properties:
    • Selected Device: NVIDIA GeForce RTX 4090
      • Compute Capability: 8.9
      • SMs: 128.0
      • Compute Clock Rate: 2.58
      • Device Global Memory: 24207 MiB
      • Shared Memory per SM: 100 KiB
      • Memory Bus Width: 384.0
      • Memory Clock Rate: 10.501

Table Info:

  • "Average time": refers to the sum of the layer latencies, when profiling layers separately.
  • "Throughput": is measured in inferences per second (IPS).

Origin

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c FP16 1 271 48.2 611.7 792 792 2.1
8 273 48.2 4809.1 151 1209 7.3
yolov9-e FP16 8 477 109.3 13461.3 57 457 18.8
1 487 109.3 1706.5 353 353 4.3

Last Layer not Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 288 29.4 534.7 951 951 1.9
8 287 29.4 4190.2 181 1447 6.4
yolov9-e-qat FP16 INT8 1 526 63.1 1757.0 405 405 4.1
8 526 63.1 13407.7 60 482 18.2

All Layers Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 295 24.2 540.1 957 957 1.9
8 293 24.2 4216.7 193 1547 6.1
yolov9-e-qat FP16 INT8 1 532 57.8 1779.5 396 396 4.1
8 532 57.8 13431.8 62 493 17.8

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions