Skip to content

xiaoachen98/Open-LLaVA-NeXT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-LLaVA-NeXT

An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.

Resources: [🤗HuggingFace]

💡 Highlights

  • 🔥 All training data and checkpoints at each stage are open-sourced, friendly for research usage.
  • 🔥 Able to reproduce the results of LLaVA-NeXT.
  • 🔥 Based on the LLaVA codebase with minimal modification, easy to follow.

🤖 Model Zoo

See more details in ModelZoo.md.

Name ViT LLM Weights MME SEED SQA MMB MMB-CN TextVQA GQA
llava-next-vicuna-7b CLIP-L-336 Vicuna-7B SFT 1519 70.2 70.1 67.4 60.6 64.9 64.2
open-llava-next-vicuna-7b CLIP-L-336 Vicuna-7B PT, SFT 1540 71.1 70.7 68.5 60.7 67.2 64.3
llava-next-llama3-8b CLIP-L-336 LLaMA3-8B SFT 1591 72.7 73.4 72.6 69.0 65.0 65.5
open-llava-next-llama3-8b CLIP-L-336 LLaMA3-8B PT, SFT 1552 74.4 77.3 74.4 70.4 69.8 65.9

👨‍💻 ToDo

  • Reproduce LLaVA-Next-LLaMA3-8B
  • Integrate VLMEvalKit for convenient evaluation

🔧 Install

  1. Clone this repository and navigate to Open-LLaVA-NeXT folder
git clone https://github.com/xiaoachen98/Open-LLaVA-NeXT.git
cd Open-LLaVA-NeXT
  1. Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Preparation

You should follow this instruction Data.md to manage the training datasets.

Training Overview

Open-LLaVA-NeXT training consists of two stages: (1) feature alignment stage: use 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: finetune the entire model with 1M completely open source data. Detailed data statics is provided in Visual Instruction Tuning. We take the Vicuna-v1.5-7B variant as example to present the training and evaluation details.

Open-LLaVA-NeXT series are trained on A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. And utilizing DeepSpeed ZeRO-3 can further reduce the memory requirements. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a same set of hyperparameters as LLaVA in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Projector lr Epochs Max length Weight decay
Open-LLaVA-NeXT-7B 256 1e-3 1 4096 0
  1. Finetuning
Hyperparameter Global Batch Size LLM lr Projector lr Vision Tower lr Epochs Max length Weight decay
Open-LLaVA-NeXT-7B 128 2e-5 2e-5 2e-6 1 4096 0

Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.

Pretrain takes around 5 hours for Open-LLaVA-NeXT-7B on 16 x A100 (80G).

Training script with DeepSpeed ZeRO-2: pretrain.sh.

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.

Visual Instruction Tuning

  1. Prepare data You should follow the instructions for data preparation in Data.
  2. Prepare MLP projectors You may download our pretrained projectors in Model Zoo, or specify your own MLP projector after pre-training.
  3. Start training Visual instruction tuning takes around 20 hours for Open-LLaVA-NeXT-7B on 16x A100 (80G).

Training script with DeepSpeed ZeRO-2: finetune.sh.

New options to note:

  • --unfreeze_mm_vision_tower True: finetune vision tower.
  • --mm_vision_tower_lr 2e-6: learning rate of vision tower.
  • --image_aspect_ratio anyres: Process an image with variable resolutions.
  • --mm_patch_merge_type spatial_unpad: This unpads a PyTorch tensor of a padded and resized image, and by inserting learnable newline vectors into image tokens, the model becomes aware of two-dimensional spatial information. This is used to process image token.

Evaluation

See Evaluation.md.

Citation

If you find this project useful in your research, please consider cite:

@misc{chen2024open,
  title={Open-LLaVA-NeXT: An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.},
  author={Chen, Lin and Xing, Long},
  howpublished = {\url{https://github.com/xiaoachen98/Open-LLaVA-NeXT}},
  year={2024},
  doi={10.5281/zenodo.13935471}
}

❤️ Acknowledgments

  • LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.
  • ShareGPT4V: Thanks for their code about finetuning the vision tower.
  • VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!