DistilCLIP: Vision Transformer + DistilBERT CLIP Implementation

DistilCLIP is a from-scratch implementation of a CLIP-like model, using a Vision Transformer (ViT) as the image encoder and a pretrained DistilBERT as the text encoder. This model was trained on the Naruto BLIP Captions dataset for 25 epochs to understand multimodal representations of anime images and their corresponding captions.

DistilCLIP is inspired by OpenAI's CLIP but substitutes the original encoders with a Vision Transformer (ViT) for processing images and DistilBERT for text processing. This model allows for efficient and scalable multimodal learning.

Architecture

Image Encoder: Vision Transformer (ViT)
Text Encoder: Pretrained DistilBERT
Multimodal Projection: Linear layers that project both image and text features into a shared latent space, following a CLIP-like architecture.

Dataset

"jxie/flickr8k" on HuggingFace

The Flickr 8K Dataset consists of general image and caption pairs.

Number of Images: 1,224
Caption Format: Descriptive text describing the scene, characters, or context of the image.
Data Augmentation: Minimal augmentations were applied to images, including resizing and normalization.

Training Details

The model was trained for 25 epochs on the Naruto BLIP Captions Dataset.

Key details:

Optimizer: AdamW
Learning Rate: 5e-5 (with cosine annealing schedule)
Batch Size: 64
Loss Function: Contrastive loss, aiming to align similar image-text pairs in the latent space
Training Hardware: Tesla T4 GPU
Epochs: 25 (Underfitted , more epochs needed)

Results

Installation

To set up the DistilCLIP project, follow these steps:

Clone the repository:

git clone https://github.com/sidmanale643/Distil-CLIP.git
cd Distil-CLIP

Install the required dependencies:
```
pip install -r requirements.txt
```

Contributing

Contributions to DistilCLIP are welcome! Please feel free to submit a Pull Request.

Acknowledgments

OpenAI for the original CLIP model
Hugging Face for the DistilBERT implementation
Lambdalabs

Contact

For any questions or feedback, please open an issue in this repository or contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
clip.py		clip.py
requirements.txt		requirements.txt
train.py		train.py
train_and_inference.ipynb		train_and_inference.ipynb
vision_transformer.py		vision_transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DistilCLIP: Vision Transformer + DistilBERT CLIP Implementation

Architecture

Dataset

Training Details

Results

Installation

Contributing

Acknowledgments

Contact

About

Releases

Packages

Languages

sidmanale643/Distil-CLIP

Folders and files

Latest commit

History

Repository files navigation

DistilCLIP: Vision Transformer + DistilBERT CLIP Implementation

Architecture

Dataset

Training Details

Results

Installation

Contributing

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages