Skip to content

Latest commit

 

History

History
 
 

vlmo

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

VLMo - General-purpose Multimodal Pre-training

Paper: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.

Official PyTorch implementation and pre-trained models of VLMo.

  • Dec, 2022: Code & model release.
  • Sep, 2022: VLMo was accepted by NeurIPS 2022.
  • May 30th, 2022: new version of VLMo paper on arXiv.
  • November 24th, 2021: VLMo Large (single model) as the new SOTA on the VQA Challenge
  • Nov 2021: release preprint in arXiv

Pre-trained Models

We provide three VLMo weights pre-trained on COCO, VG, SBU and GCC. The models were pre-trained with 224x224 resolution.

  • VLMo-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #VL_FFN=2 (#parameters: 175M)
  • VLMo-base_plus: #layer=24; hidden=544; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 167M)
  • VLMo-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 562M)

Setup

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel bash

First, clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/vlmo

pip install -r requirements.txt

Dataset Preparation

We process the pre-training and fine-tuning data to the same format as in ViLT.

Pre-training

Replace <ARROW_ROOT> as your data dir in following commands.

Step 1: Vision Pre-Training

Download the pre-trained model weight from BEiT repo.

Step 2: Language Pre-Training (VLMo-Base)

# download from https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
export INIT_CKPT=/path/to/save/beit_base_checkpoint

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_textmlm_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Or you can download our pre-trained ckpts for this stage:

Step 3: Vision-Language Pre-Training (VLMo-Base)

export INIT_CKPT=/path/to/save/last_stage_ckpt

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_itc_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Fine-Tuning on Downstream Tasks

Commands

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<VLMo_WEIGHT>" log_dir=<YOUR_OUTPUT_PATH>

To reduce GPU memory cost, use Deepspeed and Activation Checkpoint.

Configs

You can found "<CONFIG_NAME>" for each task as follows:

VQAv2

<CONFIG_NAME> initialized checkpoint finetuned weight test-dev
task_finetune_vqa_base_image480 VLMo-base weight 76.6
task_finetune_vqa_base_plus_image480 VLMo-base_plus weight 78.5
task_finetune_vqa_large_image480 VLMo-large weight 79.9

NLVR2

<CONFIG_NAME> initialized checkpoint finetuned weight test-P
task_finetune_nlvr2_base_image384 VLMo-base weight 83.3
task_finetune_nlvr2_base_plus_image384 VLMo-base_plus weight 85.1
task_finetune_nlvr2_large_image384 VLMo-large weight 86.9

COCO

<CONFIG_NAME> initialized checkpoint finetuned weight TR@1 IR@1
task_finetune_irtr_coco_base_image384 VLMo-base weight 74.8 57.2
task_finetune_irtr_coco_base_plus_image384 VLMo-base_plus weight 76.3 58.6
task_finetune_irtr_coco_large_image384 VLMo-large weight 78.2 60.6

F30K

<CONFIG_NAME> initialized checkpoint finetuned weight TR@1 IR@1
task_finetune_irtr_f30k_base_image384 VLMo-base_coco_finetuned weight 92.3 79.3
task_finetune_irtr_f30k_base_plus_image384 VLMo-base_plus weight 93.2 81.8
task_finetune_irtr_f30k_large_image384 VLMo-large_coco_finetuned weight 95.3 84.5

Evaluation

To eval a finetuned model by appending test_only=True and set load_path= to the finetuned VLMo weight as follow:

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=1 "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<Finetuned_VLMo_WEIGHT>" test_only=True
  • For retrieval tasks, also set get_recall_metric=True in the command.

Acknowledgement

This repository is built using the ViLT repository, BEiT repository, ALBEF and the timm library.

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{vlmo,
      title={{VLMo}: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts},
      author={Hangbo Bao and Wenhui Wang and Li Dong and Qiang Liu and Owais Khan Mohammed and Kriti Aggarwal and Subhojit Som and Songhao Piao and Furu Wei},
      booktitle={Advances in Neural Information Processing Systems},
      year={2022},
      url={https://openreview.net/forum?id=bydKs84JEyw}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VLMo models, please submit a GitHub issue.