Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum
- DINO with modified training engine
- Main Results with Pretrained Models
- Training DINO
- Evaluate DINO
- Citation
We've provide a hacked train_net.py which aligns the optimizer params with Deformable-DETR that can achieve a better result on DINO models.
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-R50-4scale (hacked trainer) | R-50 | IN1k | 12 | 100 | 49.4 | model |
DINO-R50-4scale (hacked trainer) | R-50 | IN1k | 12 | 100 | 49.8 | model |
DINO-R50-4scale (hacked trainer) | R-50 | IN1k | 12 | 300 | 50.0 | model |
- Training model with hacked trainer
python projects/dino/train_net.py --config-file /path/to/config.py --num-gpus 8
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-R50-4scale | R-50 | IN1k | 12 | 100 | 49.2 | model |
DINO-R50-4scale with AMP | R-50 | IN1k | 12 | 100 | 49.1 | - |
DINO-R50-4scale with EMA | R-50 | IN1k | 12 | 100 | 49.4 | model |
DINO-R50-5scale | R-50 | IN1k | 12 | 100 | 49.6 | model |
DINO-R50-4scale | R-50 | IN1k | 12 | 300 | 49.5 | model |
DINO-R50-4scale | R-50 | IN1k | 24 | 100 | 50.6 | model |
DINO-R101-4scale | R-101 | IN1k | 12 | 100 | 50.0 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-Swin-T-224-4scale | Swin-Tiny-224 | IN1k | 12 | 100 | 51.3 | model |
DINO-Swin-T-224-4scale | Swin-Tiny-224 | IN22k to IN1k | 12 | 100 | 52.5 | model |
DINO-Swin-S-224-4scale | Swin-Small-224 | IN1k | 12 | 100 | 53.0 | model |
DINO-Swin-S-224-4scale | Swin-Small-224 | IN22K to IN1K | 12 | 100 | 54.5 | model |
DINO-Swin-B-384-4scale | Swin-Base-384 | IN22k to IN1k | 12 | 100 | 55.8 | model |
DINO-Swin-L-224-4scale | Swin-Large-224 | IN22k to IN1k | 12 | 100 | 56.9 | model |
DINO-Swin-L-384-4scale | Swin-Large-384 | IN22k to IN1k | 12 | 100 | 56.9 | model |
DINO-Swin-L-384-5scale | Swin-Large-384 | IN22k to IN1k | 12 | 100 | 57.5 | model |
DINO-Swin-L-384-4scale | Swin-Large-384 | IN22k to IN1k | 36 | 100 | 58.1 | model |
DINO-Swin-L-384-5scale | Swin-Large-384 | IN22k to IN1k | 36 | 100 | 58.5 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-Focal-Large-4scale | FocalNet-384-LRF-3Level | IN22k | 12 | 100 | 57.5 | model |
DINO-Focal-Large-4scale | FocalNet-384-LRF-3Level | IN22k | 36 | 100 | 58.3 | model |
DINO-Focal-Large-4scale | FocalNet-384-LRF-4Level | IN22k | 12 | 100 | 58.0 | model |
DINO-Focal-Large-5scale | FocalNet-384-LRF-4Level | IN22k | 12 | 100 | 58.5 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-ViTDet-Base-4scale | ViT | IN1k, MAE | 12 | 100 | 50.2 | model |
DINO-ViTDet-Base-4scale | ViT | IN1k, MAE | 50 | 100 | 55.0 | model |
DINO-ViTDet-Large-4scale | ViT | IN1k, MAE | 12 | 100 | 52.9 | model |
DINO-ViTDet-Large-4scale | ViT | IN1k, MAE | 50 | 100 | 57.5 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-ConvNeXt-Tiny-384-4scale | ConvNeXt-Tiny-384 | IN1K | 12 | 100 | 51.4 | model |
DINO-ConvNeXt-Tiny-384-4scale | ConvNeXt-Tiny-384 | IN22k | 12 | 100 | 52.4 | model |
DINO-ConvNeXt-Small-384-4scale | ConvNeXt-Small-384 | IN1K | 12 | 100 | 52.0 | model |
DINO-ConvNeXt-Small-384-4scale | ConvNeXt-Small-384 | IN22k | 12 | 100 | 54.2 | model |
DINO-ConvNeXt-Base-384-4scale | ConvNeXt-Base-384 | IN1K | 12 | 100 | 52.6 | model |
DINO-ConvNeXt-Base-384-4scale | ConvNeXt-Base-384 | IN22k | 12 | 100 | 55.1 | model |
DINO-ConvNeXt-Large-384-4scale | ConvNeXt-Large-384 | IN1K | 12 | 100 | 53.4 | model |
DINO-ConvNeXt-Large-384-4scale | ConvNeXt-Large-384 | IN22k | 12 | 100 | 55.5 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-InternImage-Tiny-4scale | InternImage-Tiny | IN1k | 12 | 100 | 52.3 | model |
DINO-InternImage-Small-4scale | InternImage-Small | IN1k | 12 | 100 | 53.6 | model |
DINO-InternImage-Base-4scale | InternImage-Base | IN1k | 12 | 100 | 54.7 | model |
DINO-InternImage-Large-4scale | InternImage-Large | IN22k | 12 | 100 | 57.0 | model |
Name | Backbone | Pretrain | Epochs | Denoising Queries | box AP |
download |
---|---|---|---|---|---|---|
DINO-EVA-01 | EVA-01 | o365 | 12 | 100 | 59.1 | huggingface |
Note:
Swin-X-384
means the backbone pretrained resolution is384 x 384
andIN22k to In1k
means the model is pretrained onImageNet-22k
and finetuned onImageNet-1k
.- ViT backbone using MAE pretraining weights following ViTDet which can be downloaded in MAE. And it's not stable to train ViTDet-DINO without warmup lr-scheduler.
Focal-LRF-3Level
: means usingLarge-Receptive-Field (LRF)
andFocal-Level
is setted to3
, please refer to FocalNet for more details about the backbone settings.with AMP
: means using mixed precision training.with EMA
: means training with model Exponential Moving Average (EMA).
Notable facts and caveats: The position embedding of DINO in detrex is different from the original repo. We set the tempureture and offsets in PositionEmbeddingSine
to 10000
and -0.5
which may make the model converge a little bit faster in the early stage and get a slightly better results (about 0.1mAP) in 12 epochs settings.
All configs can be trained with:
cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8
By default, we use 8 GPUs with total batch size as 16 for training.
Model evaluation can be done as follows:
cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@misc{zhang2022dino,
title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection},
author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum},
year={2022},
eprint={2203.03605},
archivePrefix={arXiv},
primaryClass={cs.CV}
}