- Pretrain T5 v1.1: Pre-Train T5 on C4 Dataset Code. This code is exceedingly less complicated, more readable, and truer to Google's implementation, than other available options thanks to HuggingFace. In comparison to the T5 1.1 paper which reports 1.942 loss at 65,536 steps, a single RTX 4090 produces comparable results on the test set (2.08) in roughly 18.5 hours of training using this code (see image below). Pretraining on your own data set is as simple as swapping out the existing
Dataset
with your own.
-
Seq2Seq (ChatBot): Fine Tune Flan-T5 on Alpaca Code
-
Seq2Seq: Fine Tune Flan-T5 on Data Using HuggingFace Dataset Framework Code
input sentence: Given a set of numbers, find the maximum value.
{10, 3, 25, 6, 16}
response: 25
input sentence: Convert from celsius to fahrenheit.
Temperature in Celsius: 15
response: Fahrenheit
input sentence: Arrange the given numbers in ascending order.
2, 4, 0, 8, 3
response: 0, 3, 4, 8
input sentence: What is the capital of France?
response: paris
input sentence: Name two types of desert biomes.
response: sahara
input sentence: Given a set of numbers, find the maximum value.
{10, 3, 25, 6, 16}
response: 25
input sentence: Convert from celsius to fahrenheit.
Temperature in Celsius: 15
response: 77
input sentence: Arrange the given numbers in ascending order.
2, 4, 0, 8, 3
response: 0, 2, 3, 4, 8
input sentence: What is the capital of France?
response: Paris
input sentence: Name two types of desert biomes.
response: Desert biomes can be divided into two main types: arid and semi-arid. Arid deserts are characterized by high levels of deforestation, sparse vegetation, and limited water availability. Semi-desert deserts, on the other hand, are relatively dry deserts with little to no vegetation.
This repository contains start-to-finish data processing and NLP algorithms using PyTorch and often HuggingFace (Transformers) for the following models:
-
Paper: Hierarchical Attention Networks PyTorch Implementation: Code
-
Paper: BERT PyTorch Implementation: Code
-
BERT-CNN Ensemble. PyTorch Implementation: Code
-
Paper: Character-level CNN PyTorch Implementation: Code
-
Paper: DistilBERT PyTorch Implementation: Code
-
DistilGPT-2. PyTorch Implementation: Code
-
Paper: Convolutional Neural Networks for Sentence Classification PyTorch Implementation: Code
-
Paper: T5-Classification PyTorch Implementation: Code
-
Paper: T5-Summarization PyTorch Implementation: Code
-
Building a Corpus: Search Text Files. Code: Code
-
Paper: Heinsein Routing TorchText Implementation: Code
-
Entity Embeddings and Lazy Loading. Code: Code
-
Semantic Similarity. Code: Code
-
SQuAD 2.0 BERT Embeddings Emissions in PyTorch. Code: Code
-
SST-5 BERT Embeddings Emissions in PyTorch. Code: Code
Credits: The Hedwig group has been instrumental in helping me learn many of these models.
Nicer-looking R Markdown outputs can be found here: http://seekinginference.com/NLP/