Skip to content

Latest commit

 

History

History

dino

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum

[arXiv] [BibTeX]


Table of Contents

DINO with modified training engine

We've provide a hacked train_net.py which aligns the optimizer params with Deformable-DETR that can achieve a better result on DINO models.

Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-R50-4scale (hacked trainer) R-50 IN1k 12 100 49.4 model
DINO-R50-4scale (hacked trainer) R-50 IN1k 12 100 49.8 model
DINO-R50-4scale (hacked trainer) R-50 IN1k 12 300 50.0 model
  • Training model with hacked trainer
python projects/dino/train_net.py --config-file /path/to/config.py --num-gpus 8

Main Results with Pretrained Models

Pretrained DINO with ResNet Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-R50-4scale R-50 IN1k 12 100 49.2 model
DINO-R50-4scale with AMP R-50 IN1k 12 100 49.1 -
DINO-R50-4scale with EMA R-50 IN1k 12 100 49.4 model
DINO-R50-5scale R-50 IN1k 12 100 49.6 model
DINO-R50-4scale R-50 IN1k 12 300 49.5 model
DINO-R50-4scale R-50 IN1k 24 100 50.6 model
DINO-R101-4scale R-101 IN1k 12 100 50.0 model
Pretrained DINO with Swin-Transformer Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-Swin-T-224-4scale Swin-Tiny-224 IN1k 12 100 51.3 model
DINO-Swin-T-224-4scale Swin-Tiny-224 IN22k to IN1k 12 100 52.5 model
DINO-Swin-S-224-4scale Swin-Small-224 IN1k 12 100 53.0 model
DINO-Swin-S-224-4scale Swin-Small-224 IN22K to IN1K 12 100 54.5 model
DINO-Swin-B-384-4scale Swin-Base-384 IN22k to IN1k 12 100 55.8 model
DINO-Swin-L-224-4scale Swin-Large-224 IN22k to IN1k 12 100 56.9 model
DINO-Swin-L-384-4scale Swin-Large-384 IN22k to IN1k 12 100 56.9 model
DINO-Swin-L-384-5scale Swin-Large-384 IN22k to IN1k 12 100 57.5 model
DINO-Swin-L-384-4scale Swin-Large-384 IN22k to IN1k 36 100 58.1 model
DINO-Swin-L-384-5scale Swin-Large-384 IN22k to IN1k 36 100 58.5 model
Pretrained DINO with FocalNet Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-Focal-Large-4scale FocalNet-384-LRF-3Level IN22k 12 100 57.5 model
DINO-Focal-Large-4scale FocalNet-384-LRF-3Level IN22k 36 100 58.3 model
DINO-Focal-Large-4scale FocalNet-384-LRF-4Level IN22k 12 100 58.0 model
DINO-Focal-Large-5scale FocalNet-384-LRF-4Level IN22k 12 100 58.5 model
Pretrained DINO with ViT Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-ViTDet-Base-4scale ViT IN1k, MAE 12 100 50.2 model
DINO-ViTDet-Base-4scale ViT IN1k, MAE 50 100 55.0 model
DINO-ViTDet-Large-4scale ViT IN1k, MAE 12 100 52.9 model
DINO-ViTDet-Large-4scale ViT IN1k, MAE 50 100 57.5 model
Pretrained DINO with ConvNeXt Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-ConvNeXt-Tiny-384-4scale ConvNeXt-Tiny-384 IN1K 12 100 51.4 model
DINO-ConvNeXt-Tiny-384-4scale ConvNeXt-Tiny-384 IN22k 12 100 52.4 model
DINO-ConvNeXt-Small-384-4scale ConvNeXt-Small-384 IN1K 12 100 52.0 model
DINO-ConvNeXt-Small-384-4scale ConvNeXt-Small-384 IN22k 12 100 54.2 model
DINO-ConvNeXt-Base-384-4scale ConvNeXt-Base-384 IN1K 12 100 52.6 model
DINO-ConvNeXt-Base-384-4scale ConvNeXt-Base-384 IN22k 12 100 55.1 model
DINO-ConvNeXt-Large-384-4scale ConvNeXt-Large-384 IN1K 12 100 53.4 model
DINO-ConvNeXt-Large-384-4scale ConvNeXt-Large-384 IN22k 12 100 55.5 model
Pretrained DINO with InternImage Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-InternImage-Tiny-4scale InternImage-Tiny IN1k 12 100 52.3 model
DINO-InternImage-Small-4scale InternImage-Small IN1k 12 100 53.6 model
DINO-InternImage-Base-4scale InternImage-Base IN1k 12 100 54.7 model
DINO-InternImage-Large-4scale InternImage-Large IN22k 12 100 57.0 model
Pretrained DINO with EVA Backbone
Name Backbone Pretrain Epochs Denoising Queries box
AP
download
DINO-EVA-01 EVA-01 o365 12 100 59.1 huggingface

Note:

  • Swin-X-384 means the backbone pretrained resolution is 384 x 384 and IN22k to In1k means the model is pretrained on ImageNet-22k and finetuned on ImageNet-1k.
  • ViT backbone using MAE pretraining weights following ViTDet which can be downloaded in MAE. And it's not stable to train ViTDet-DINO without warmup lr-scheduler.
  • Focal-LRF-3Level: means using Large-Receptive-Field (LRF) and Focal-Level is setted to 3, please refer to FocalNet for more details about the backbone settings.
  • with AMP: means using mixed precision training.
  • with EMA: means training with model Exponential Moving Average (EMA).

Notable facts and caveats: The position embedding of DINO in detrex is different from the original repo. We set the tempureture and offsets in PositionEmbeddingSine to 10000 and -0.5 which may make the model converge a little bit faster in the early stage and get a slightly better results (about 0.1mAP) in 12 epochs settings.

Training

All configs can be trained with:

cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8

By default, we use 8 GPUs with total batch size as 16 for training.

Evaluation

Model evaluation can be done as follows:

cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint

Citing DINO

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{zhang2022dino,
      title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection}, 
      author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum},
      year={2022},
      eprint={2203.03605},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}