- Embedding Dimensions: 768
- Vocabulary Size: 50,257
- Sequence Length: 1,024
- Attention Heads: 8
- Decoder Blocks: 12
- Dropout: 0.1
The GPT-2 model is based on the transformer architecture, specifically designed for natural language processing tasks. Key components include:
- Positional Encoding: Helps the model understand the order of words in a sequence.
- Multi-Head Attention: Allows the model to focus on different parts of the input simultaneously.
- Feed-Forward Networks: Applies non-linear transformations to the input data.
- Layer Normalization: Stabilizes and accelerates the training process.
This project implements the GPT-2 model from scratch, providing a deep understanding of its inner workings. The implementation closely follows the original architecture while offering customization options.
- Andreij Karpathy Lectures - https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&pp=iAQB
- Sebastian Raschka - https://github.com/rasbt/LLMs-from-scratch
- Original GPT-2 Paper - https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Umar Jamil YouTube - https://www.youtube.com/watch?v=ISNdQcPhsts&t=4760s&pp=ygUKdW1hciBqYW1pbA%3D%3D