Skip to content

personal repo for implementing basic transformer models

Notifications You must be signed in to change notification settings

SNUDerek/toyTransformers

Repository files navigation

toy transformers

(WIP) toy implementations of transformer models, for my own self-enlightenment.

about

inspired by minGPT, i wanted to implement some basic transformer models "from scratch" by following the original papers. this is a purely academic challenge for myself, a practicum of sorts for my transformer reading, and this code is not intended to be suitable for any real-world applications. while i am trying to adhere to the original papers as my primary reference, i am checking my code against other implementations to ensure that i am not totally off-base, and modifying as necessary. that said, i do not guarantee the accuracy of my implementation, and any implementational errors are my own.

current status & plans

environment

this is being developed with the following environment:

  • python 3.7.11
  • pytorch 1.7.1 for cuda 11.0

training is done on a GTX 2080Ti

see requirements.txt for other required packages

how to use

tokenizer

the SentencePieceTokenizer tokenizer is a pickleable (tested with dill) class that wraps sentencepiece. it has fit(), fit_on_files(), transform() and inverse_transform() methods for fitting on a list of sentences or on one or more sentence-split text file(s), transforming a list of string inputs to padded numpy arrays and array of lengths, and transforming numpy arrays of indexed tokens back into text or readable tokens.

>>> tokenizer = SentencePieceTokenizer()
>>> tokenizer.fit_on_files(["data/aeneid.txt", "data/iliad.txt", "data/odyssey.txt"], vocab_size=12000, character_coverage=0.9999)

8000

>>> ids, lens = tokenizer.transform(lines[:8], max_len=100, as_array=True)
>>> print(type(ids).__name__, ids.shape)

ndarray (8, 100)

>>> tokenizer.inverse_transform(ids)

['BOOK I',
 'THE LANDING NEAR CARTHAGE',
 'Arms and the man I sing, the first who came,',
 'Compelled by fate, an exile out of Troy,',
 'To Italy and the Lavinian coast,',
 'Much buffeted on land and on the deep',
 'By violence of the gods, through that long rage,',
 'That lasting hate, of Juno’s. And he suffered']
 
>>> tokenizer.export_model("data/_test.model")

True

>>> tokenizer2.load_model("data/_test.model")
>>> tokenizer2.tokenize_as_string(["hello, world!", "this is a test"])

[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]

>>> pickle.dump(tokenizer, open("data/test.tokenizer", "wb"))
>>> tokenizer3 = pickle.load(open("data/test.tokenizer", "rb"))
>>> tokenizer3.tokenize_as_string(["hello, world!", "this is a test"])

[['▁hell', 'o', ',', '▁world', '!'], ['▁this', '▁is', '▁a', '▁test']]

references

papers

Attention is All You Need
Improving Language Understanding by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
On Layer Normalization in the Transformer Architecture

reference implementations and articles

pytorch Transformer documentation
The Annotated Transformer
TRANSFORMERS FROM SCRATCH
The Illustrated Transformer
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive
How to code The Transformer in Pytorch
github: wzlxjtu/PositionalEncoding2D

About

personal repo for implementing basic transformer models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published