We provide four BEiT weights pretrained on ImageNet-1k. The models were pretrained with 224x224 resolution.
BEiT-base
: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16 (#parameters: 86M)BEiT-large
: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16 (#parameters: 304M)
Download checkpoints that are self-supervised pretrained on ImageNet-1k and then intermediate finetuned on ImageNet-21k (recommended):
- BEiT-base: beitv2_base_patch16_224_pt1k_ft21k
- BEiT-large: beitv2_large_patch16_224_pt1k_ft21k
Download checkpoints that are self-supervised pretrained on ImageNet-1k:
- BEiT-base: beitv2_base_patch16_224_pt1k
- BEiT-large: beitv2_large_patch16_224_pt1k
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel bash
First, clone the repo and install required packages:
git clone https://github.com/microsoft/unilm.git
cd unilm/beit2
pip install -r requirements.txt
The required packages including: Pytorch version 1.7.1, torchvision version 0.8.2 and Timm version 0.4.12, etc.
For mixed-precision training, please install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at get_started_for_image_classification.md.
name | initialized checkpoint | resolution | acc@1 | acc@5 | #params | weight |
---|---|---|---|---|---|---|
BEiTv2-base | beitv2_base_patch16_224_pt1k | 224x224 | 85.5 | 97.5 | 86.5M | link |
BEiTv2-base | beitv2_base_patch16_224_pt1k_ft21k | 224x224 | 86.5 | 98.0 | 86.5M | link |
BEiTv2-large | beitv2_base_patch16_224_pt1k | 224x224 | 87.3 | 98.2 | 304M | link |
BEiTv2-large | beitv2_base_patch16_224_pt1k_ft21k | 224x224 | 88.4 | 98.6 | 304M | link |
We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at semantic_segmentation/README.md
.
name | initialized checkpoint | method | crop size | iterations | mIoU | #params | weight |
---|---|---|---|---|---|---|---|
BEiTv2-base | beitv2_base_patch16_224_pt1k | UPerNet | 512x512 | 160k | 53.1 | 163M | link |
BEiTv2-base | beitv2_base_patch16_224_pt1k_ft21k | UPerNet | 512x512 | 160k | 53.5 | 163M | link |
BEiTv2-large | beitv2_large_patch16_224_pt1k | UPerNet | 512x512 | 160k | 56.7 | 441M | link |
BEiTv2-large | beitv2_large_patch16_224_pt1k_ft21k | UPerNet | 512x512 | 160k | 57.5 | 441M | link |
Under preparation.
See PRETRAINING.md for detailed instructions.
We provide the VQ-KD tokenizer trained on ImageNet-1k.
- vqkd_encoder_base_decoder_3x768x12: #encoder layer=12; #decoder layer=3; hidden=768; FFN factor=4x; #head=12; patch=16x16;
See TOKENIZER.md for more details.
If you find this repository useful, please consider citing our work:
@inproceedings{beit,
title={{BEiT}: {BERT} Pre-Training of Image Transformers},
author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=p-BhZSz59o4}
}
@article{beitv2,
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
year={2022},
eprint={2208.06366},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built using the BEiT, the CLIP, the DeiT, the Dino repository and the timm library.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
For help or issues using BEiT v2 models, please submit a GitHub issue.
For other communications, please contact Li Dong ([email protected]
), Furu Wei ([email protected]
).