(WIP) toy implementations of transformer models, for my own self-enlightenment.
inspired by minGPT, i wanted to implement some basic transformer models "from scratch" by following the original papers. this is a purely academic challenge for myself, a practicum of sorts for my transformer reading, and this code is not intended to be suitable for any real-world applications. while i am trying to adhere to the original papers as my primary reference, i am checking my code against other implementations to ensure that i am not totally off-base, and modifying as necessary. that said, i do not guarantee the accuracy of my implementation, and any implementational errors are my own.
- multi-head attention
- sinusoidal positional encoding
- basic transformer encoder layer, from Attention is All You Need
- basic transformer decoder layer, from Attention is All You Need
- sentencepiece-based tokenizer
- transformer seq2seq model, from Attention is All You Need
- example seq2seq training (notebook)
- example seq2seq inference with beam search (notebook)
- GPT-style decoder and example, from Improving Language Understanding by Generative Pre-Training
- BERT-style encoder and example, from BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- ALBERT-style encoder and example, from ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- newer positional embeddings besides learned and sinusoidal
- proper training, eval scripts
- tensorboard integration
this is being developed with the following environment:
python 3.7.11
pytorch 1.7.1
forcuda 11.0
training is done on a GTX 2080Ti
see requirements.txt
for other required packages
the SentencePieceTokenizer
tokenizer is a pickleable (tested with dill
) class that wraps sentencepiece. it has fit()
, fit_on_files()
, transform()
and inverse_transform()
methods for fitting on a list of sentences or on one or more sentence-split text file(s), transforming a list of string inputs to padded numpy arrays and array of lengths, and transforming numpy arrays of indexed tokens back into text or readable tokens.
>>> tokenizer = SentencePieceTokenizer()
>>> tokenizer.fit_on_files(["data/aeneid.txt", "data/iliad.txt", "data/odyssey.txt"], vocab_size=12000, character_coverage=0.9999)
8000
>>> ids, lens = tokenizer.transform(lines[:8], max_len=100, as_array=True)
>>> print(type(ids).__name__, ids.shape)
ndarray (8, 100)
>>> tokenizer.inverse_transform(ids)
['BOOK I',
'THE LANDING NEAR CARTHAGE',
'Arms and the man I sing, the first who came,',
'Compelled by fate, an exile out of Troy,',
'To Italy and the Lavinian coast,',
'Much buffeted on land and on the deep',
'By violence of the gods, through that long rage,',
'That lasting hate, of Juno’s. And he suffered']
>>> tokenizer.export_model("data/_test.model")
True
>>> tokenizer2.load_model("data/_test.model")
>>> tokenizer2.tokenize_as_string(["hello, world!", "this is a test"])
[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]
>>> pickle.dump(tokenizer, open("data/test.tokenizer", "wb"))
>>> tokenizer3 = pickle.load(open("data/test.tokenizer", "rb"))
>>> tokenizer3.tokenize_as_string(["hello, world!", "this is a test"])
[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]
Attention is All You Need
Improving Language Understanding
by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
On Layer Normalization in the Transformer Architecture
pytorch Transformer
documentation
The Annotated Transformer
TRANSFORMERS FROM SCRATCH
The Illustrated Transformer
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive
How to code The Transformer in Pytorch
github: wzlxjtu/PositionalEncoding2D