toy transformers

(WIP) toy implementations of transformer models, for my own self-enlightenment.

about

inspired by minGPT, i wanted to implement some basic transformer models "from scratch" by following the original papers. this is a purely academic challenge for myself, a practicum of sorts for my transformer reading, and this code is not intended to be suitable for any real-world applications. while i am trying to adhere to the original papers as my primary reference, i am checking my code against other implementations to ensure that i am not totally off-base, and modifying as necessary. that said, i do not guarantee the accuracy of my implementation, and any implementational errors are my own.

current status & plans

environment

this is being developed with the following environment:

python 3.7.11
pytorch 1.7.1 for cuda 11.0

training is done on a GTX 2080Ti

see requirements.txt for other required packages

how to use

tokenizer

the SentencePieceTokenizer tokenizer is a pickleable (tested with dill) class that wraps sentencepiece. it has fit(), fit_on_files(), transform() and inverse_transform() methods for fitting on a list of sentences or on one or more sentence-split text file(s), transforming a list of string inputs to padded numpy arrays and array of lengths, and transforming numpy arrays of indexed tokens back into text or readable tokens.

>>> tokenizer = SentencePieceTokenizer()
>>> tokenizer.fit_on_files(["data/aeneid.txt", "data/iliad.txt", "data/odyssey.txt"], vocab_size=12000, character_coverage=0.9999)

8000

>>> ids, lens = tokenizer.transform(lines[:8], max_len=100, as_array=True)
>>> print(type(ids).__name__, ids.shape)

ndarray (8, 100)

>>> tokenizer.inverse_transform(ids)

['BOOK I',
 'THE LANDING NEAR CARTHAGE',
 'Arms and the man I sing, the first who came,',
 'Compelled by fate, an exile out of Troy,',
 'To Italy and the Lavinian coast,',
 'Much buffeted on land and on the deep',
 'By violence of the gods, through that long rage,',
 'That lasting hate, of Juno’s. And he suffered']
 
>>> tokenizer.export_model("data/_test.model")

True

>>> tokenizer2.load_model("data/_test.model")
>>> tokenizer2.tokenize_as_string(["hello, world!", "this is a test"])

[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]

>>> pickle.dump(tokenizer, open("data/test.tokenizer", "wb"))
>>> tokenizer3 = pickle.load(open("data/test.tokenizer", "rb"))
>>> tokenizer3.tokenize_as_string(["hello, world!", "this is a test"])

[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]

references

papers

Attention is All You Need
Improving Language Understanding by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
On Layer Normalization in the Transformer Architecture

reference implementations and articles

pytorch Transformer documentation
The Annotated Transformer
TRANSFORMERS FROM SCRATCH
The Illustrated Transformer
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive
How to code The Transformer in Pytorch
github: wzlxjtu/PositionalEncoding2D

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
mytransformers		mytransformers
samples		samples
.gitignore		.gitignore
README.md		README.md
example_01_transformer_training.ipynb		example_01_transformer_training.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

toy transformers

about

current status & plans

environment

how to use

tokenizer

references

papers

reference implementations and articles

About

Releases

Packages

Contributors 2

Languages

SNUDerek/toyTransformers

Folders and files

Latest commit

History

Repository files navigation

toy transformers

about

current status & plans

environment

how to use

tokenizer

references

papers

reference implementations and articles

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages