dino

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum

[arXiv] [BibTeX]

DINO with modified training engine

We've provide a hacked train_net.py which aligns the optimizer params with Deformable-DETR that can achieve a better result on DINO models.

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-R50-4scale (hacked trainer)	R-50	IN1k	12	100	49.4	model
DINO-R50-4scale (hacked trainer)	R-50	IN1k	12	100	49.8	model
DINO-R50-4scale (hacked trainer)	R-50	IN1k	12	300	50.0	model

Training model with hacked trainer

python projects/dino/train_net.py --config-file /path/to/config.py --num-gpus 8

Main Results with Pretrained Models

Pretrained DINO with ResNet Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-R50-4scale	R-50	IN1k	12	100	49.2	model
DINO-R50-4scale with AMP	R-50	IN1k	12	100	49.1	-
DINO-R50-4scale with EMA	R-50	IN1k	12	100	49.4	model
DINO-R50-5scale	R-50	IN1k	12	100	49.6	model
DINO-R50-4scale	R-50	IN1k	12	300	49.5	model
DINO-R50-4scale	R-50	IN1k	24	100	50.6	model
DINO-R101-4scale	R-101	IN1k	12	100	50.0	model

Pretrained DINO with Swin-Transformer Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-Swin-T-224-4scale	Swin-Tiny-224	IN1k	12	100	51.3	model
DINO-Swin-T-224-4scale	Swin-Tiny-224	IN22k to IN1k	12	100	52.5	model
DINO-Swin-S-224-4scale	Swin-Small-224	IN1k	12	100	53.0	model
DINO-Swin-S-224-4scale	Swin-Small-224	IN22K to IN1K	12	100	54.5	model
DINO-Swin-B-384-4scale	Swin-Base-384	IN22k to IN1k	12	100	55.8	model
DINO-Swin-L-224-4scale	Swin-Large-224	IN22k to IN1k	12	100	56.9	model
DINO-Swin-L-384-4scale	Swin-Large-384	IN22k to IN1k	12	100	56.9	model
DINO-Swin-L-384-5scale	Swin-Large-384	IN22k to IN1k	12	100	57.5	model
DINO-Swin-L-384-4scale	Swin-Large-384	IN22k to IN1k	36	100	58.1	model
DINO-Swin-L-384-5scale	Swin-Large-384	IN22k to IN1k	36	100	58.5	model

Pretrained DINO with FocalNet Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-Focal-Large-4scale	FocalNet-384-LRF-3Level	IN22k	12	100	57.5	model
DINO-Focal-Large-4scale	FocalNet-384-LRF-3Level	IN22k	36	100	58.3	model
DINO-Focal-Large-4scale	FocalNet-384-LRF-4Level	IN22k	12	100	58.0	model
DINO-Focal-Large-5scale	FocalNet-384-LRF-4Level	IN22k	12	100	58.5	model

Pretrained DINO with ViT Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-ViTDet-Base-4scale	ViT	IN1k, MAE	12	100	50.2	model
DINO-ViTDet-Base-4scale	ViT	IN1k, MAE	50	100	55.0	model
DINO-ViTDet-Large-4scale	ViT	IN1k, MAE	12	100	52.9	model
DINO-ViTDet-Large-4scale	ViT	IN1k, MAE	50	100	57.5	model

Pretrained DINO with ConvNeXt Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-ConvNeXt-Tiny-384-4scale	ConvNeXt-Tiny-384	IN1K	12	100	51.4	model
DINO-ConvNeXt-Tiny-384-4scale	ConvNeXt-Tiny-384	IN22k	12	100	52.4	model
DINO-ConvNeXt-Small-384-4scale	ConvNeXt-Small-384	IN1K	12	100	52.0	model
DINO-ConvNeXt-Small-384-4scale	ConvNeXt-Small-384	IN22k	12	100	54.2	model
DINO-ConvNeXt-Base-384-4scale	ConvNeXt-Base-384	IN1K	12	100	52.6	model
DINO-ConvNeXt-Base-384-4scale	ConvNeXt-Base-384	IN22k	12	100	55.1	model
DINO-ConvNeXt-Large-384-4scale	ConvNeXt-Large-384	IN1K	12	100	53.4	model
DINO-ConvNeXt-Large-384-4scale	ConvNeXt-Large-384	IN22k	12	100	55.5	model

Pretrained DINO with InternImage Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-InternImage-Tiny-4scale	InternImage-Tiny	IN1k	12	100	52.3	model
DINO-InternImage-Small-4scale	InternImage-Small	IN1k	12	100	53.6	model
DINO-InternImage-Base-4scale	InternImage-Base	IN1k	12	100	54.7	model
DINO-InternImage-Large-4scale	InternImage-Large	IN22k	12	100	57.0	model

Pretrained DINO with EVA Backbone

Name	Backbone	Pretrain	Epochs	Denoising Queries	box AP	download
DINO-EVA-01	EVA-01	o365	12	100	59.1	huggingface

Note:

Swin-X-384 means the backbone pretrained resolution is 384 x 384 and IN22k to In1k means the model is pretrained on ImageNet-22k and finetuned on ImageNet-1k.
ViT backbone using MAE pretraining weights following ViTDet which can be downloaded in MAE. And it's not stable to train ViTDet-DINO without warmup lr-scheduler.
Focal-LRF-3Level: means using Large-Receptive-Field (LRF) and Focal-Level is setted to 3, please refer to FocalNet for more details about the backbone settings.
with AMP: means using mixed precision training.
with EMA: means training with model Exponential Moving Average (EMA).

Notable facts and caveats: The position embedding of DINO in detrex is different from the original repo. We set the tempureture and offsets in PositionEmbeddingSine to 10000 and -0.5 which may make the model converge a little bit faster in the early stage and get a slightly better results (about 0.1mAP) in 12 epochs settings.

Training

All configs can be trained with:

cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8

By default, we use 8 GPUs with total batch size as 16 for training.

Evaluation

Model evaluation can be done as follows:

cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint

Citing DINO

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{zhang2022dino,
      title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection}, 
      author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum},
      year={2022},
      eprint={2203.03605},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
configs		configs
modeling		modeling
README.md		README.md
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dino

dino

README.md

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Table of Contents

DINO with modified training engine

Main Results with Pretrained Models

Pretrained DINO with ResNet Backbone

Pretrained DINO with Swin-Transformer Backbone

Pretrained DINO with FocalNet Backbone

Pretrained DINO with ViT Backbone

Pretrained DINO with ConvNeXt Backbone

Pretrained DINO with InternImage Backbone

Pretrained DINO with EVA Backbone

Training

Evaluation

Citing DINO

Files

dino

Directory actions

More options

Directory actions

More options

Latest commit

History

dino

Folders and files

parent directory

README.md

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Table of Contents

DINO with modified training engine

Main Results with Pretrained Models

Pretrained DINO with ResNet Backbone

Pretrained DINO with Swin-Transformer Backbone

Pretrained DINO with FocalNet Backbone

Pretrained DINO with ViT Backbone

Pretrained DINO with ConvNeXt Backbone

Pretrained DINO with InternImage Backbone

Pretrained DINO with EVA Backbone

Training

Evaluation

Citing DINO