link to presentation: https://github.com/BenjaminBossan/presentations
- mature: first commit July 2017
- deeply integrates scikit-learn and PyTorch (but not tensorflow etc.)
- many examples and notebooks in repository
- comprehensive docs: https://skorch.readthedocs.io
from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EpochScoring
class MyModule(nn.Module):
...
net = NeuralNetClassifier(
MyModule,
max_epochs=10,
lr=0.1,
callbacks=[EpochScoring(scoring="roc_auc", lower_is_better=False)],
)
net.fit(X_train, y_train)
net.predict(X_test)
net.predict_proba(X_test)
We’re going to look at:
- transformers & tokenizers
- parameter efficient fine-tuning
- accelerate
- large language models
- 🤗 transformers most well known Hugging Face package
- used predominantly for transformers-based pretrained models
- BERT, GPT, Falcon, Llama 2, etc.
- 🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)
from transformers import AutoModelForSequenceClassification
class BertModule(nn.Module):
def __init__(self, name, num_labels):
super().__init__()
self.num_labels = num_labels
self.bert = AutoModelForSequenceClassification.from_pretrained(
name, num_labels=self.num_labels
)
def forward(self, **kwargs):
pred = self.bert(**kwargs)
return pred.logits
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from skorch import NeuralNetClassifier
from skorch.hf import HuggingfacePretrainedTokenizer
model_name = "distilbert-base-uncased"
pipeline = Pipeline([
("tokenizer", HuggingfacePretrainedTokenizer(model_name)),
("net", NeuralNetClassifier(
BertModule,
module__name=model_name,
module__num_labels=len(set(y_train)),
criterion=nn.CrossEntropyLoss,
)),
])
pipeline.fit(X_train, y_train)
# prints
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ --------
1 1.1628 0.8338 0.5839 179.8571
2 0.3709 0.8751 0.4214 178.7779
3 0.1523 0.8910 0.3945 178.4507
y_pred = pipeline.predict(X_test)
print(accuracy_score(y_test, y_pred))
from sklearn.model_selection import GridSearchCV
params = {
"net__module__name": ["distilbert-base-uncased", "bert-base-cased"],
"net__optimizer": [torch.optim.SGD, torch.optim.Adam],
"net__lr": [0.01, 3e-4],
"net__max_epochs": [10, 20],
}
search = GridSearchCV(pipeline, params)
search.fit(X_train, y_train)
- 🤗 Transformers
- 🤗 Tokenizers
- skorch callbacks
- skorch tokenizers docs
- Grid searching with skorch
- Fine-tuning BERT notebook
- Fine-tuning ViT notebook
- PEFT implements several techniques to fine-tune models in an efficient manner
- Some techniques are specific to language models and rely on modifying the input (not covered)
- Other techniques, such as LoRA, work more generally
class MLP(nn.Module):
def __init__(self, num_units_hidden=2000):
super().__init__()
self.seq = nn.Sequential(
nn.Linear(20, num_units_hidden),
nn.ReLU(),
nn.Linear(num_units_hidden, num_units_hidden),
nn.ReLU(),
nn.Linear(num_units_hidden, 2),
nn.LogSoftmax(dim=-1),
)
def forward(self, X):
return self.seq(X)
import peft
# to show potential candidates for target modules
# print([(n, type(m)) for n, m in MLP().named_modules()])
config = peft.LoraConfig(
r=8,
target_modules=["seq.0", "seq.2"],
modules_to_save=["seq.4"],
)
peft_model = peft.get_peft_model(MLP(), config)
# only 1.4% of parameters are trained, rest is frozen
net = NeuralNetClassifier(peft_model, ...)
net.fit(X, y)
peft_model = net.module_
peft_model.save_pretrained(dir_name)
Only saves the extra LoRA parameters
478 adapter_config.json
88 README.md
145731 adapter_model.bin
---
16340459 full_model.bin
- accelerate contains many utilities around making training and inference more efficient
- Most prominently, it facilitates distributed training (DDP, FSDP, DeepSpeed, etc.)
- Also contains other utilities, like mixed precision (FP16, BF16), gradient accumulation, etc.
from accelerate import Accelerator
from skorch import NeuralNet
from skorch.hf import AccelerateMixin
class AcceleratedNet(AccelerateMixin, NeuralNet):
"""NeuralNet with accelerate support"""
accelerator = Accelerator(mixed_precision="fp16")
net = AcceleratedNet(
MyModule,
accelerator=accelerator,
)
net.fit(X, y)
- 🤗 Accelerate
- skorch accelerate docs
- Example notebook showing automatic mixed precision
- Example scripts showing DDP
- Since the GPT-3 release, we know that using Large Language Models (LLM) as zero/few-shot learners is a viable approach
- skorch’s
ZeroShotClassifier
andFewShotClassifier
implement zero/few-shot classification - Use 🤗 transformers LLMs under the hood, while behaving like sklearn classifiers
from skorch.llm import ZeroShotClassifier
X, y = ...
clf = ZeroShotClassifier("bigscience/bloomz-1b1")
clf.fit(X=None, y=["negative", "positive"])
y_pred = clf.predict(X)
y_proba = clf.predict_proba(X)
my_prompt = """Your job is to analyze the sentiment of customer reviews.
The available sentiments are: {labels}
The customer review is:
```
{text}
```
Your response:"""
clf = ZeroShotClassifier("bigscience/bloomz-1b1", prompt=my_prompt)
clf.fit(X=None, y=["negative", "positive"])
predicted_labels = clf.predict(X)
from sklearn.model_selection import GridSearchCV
from skorch.llm import DEFAULT_PROMPT_ZERO_SHOT
params = {
"model_name": ["bigscience/bloomz-1b1", "gpt2", "tiiuae/falcon-7b-instruct"],
"prompt": [DEFAULT_PROMPT_ZERO_SHOT, my_prompt],
}
metrics = ["accuracy", "neg_log_loss"]
search = GridSearchCV(clf, param_grid=params, scoring=metrics, refit=False)
search.fit(X, y)
from skorch.llm import FewShotClassifier
X_train, y_train, X_test, y_test = ...
clf = FewShotClassifier("bigscience/bloomz-1b1", max_samples=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
- Drop-in replacement for sklearn classifiers
- Forces the model to output one of the provided labels
- Returns probabilities, not just generated tokens
- For decoder-only models, supports caching, which can lead to speed ups (does not work for encoder-decoder models)
- Big choice of models from Hugging Face
- Apart from initial model download, everything runs locally, no data sent to OpenAI or anyone else
- When there are few labeled samples/when bootstrapping
- When you want to systematically study the best prompt, best LLM model, etc.
- When you need help with debugging bad LLM outputs
- When the problem domain requires advanced understanding (e.g. PIQA)
- When runtime performance or resource usage are a concern
- When there are a lot of labeled samples, supervised learning might work better
- When the task is simple, bag-of-words or similar approaches can be better even with few samples
- skorch docs on LLM classifiers
- Example notebook
- 🤗 decoder language models
- 🤗 encoder-decoder language models
- Learned how skorch helps to combine sklearn and the Hugging Face ecosystem
- What was shown is only part of what is possible
- Vision models, customized tokenizers, 🤗 Hub, safetensors, …
- Of course, the different techniques and libraries can be combined
- e.g. sklearn Pipeline + GridSearchCV + tokenizers + transformers + accelerate + PEFT
- Hugging Face: https://huggingface.co/
- skorch: https://github.com/skorch-dev/skorch
- presentation: https://github.com/BenjaminBossan/presentations
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import ViTFeatureExtractor, ViTForImageClassification
class FeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self, model_name, device="cpu"):
self.model_name = model_name
self.device = device
def fit(self, X, y=None, **fit_params):
self.extractor_ = ViTFeatureExtractor.from_pretrained(
self.model_name, device=self.device,
)
return self
def transform(self, X):
return self.extractor_(X, return_tensors="pt")["pixel_values"]
class VitModule(nn.Module):
# same idea as before
vit_model = "google/vit-base-patch32-224-in21k"
pipeline = Pipeline([
("feature_extractor", FeatureExtractor(
vit_model,
device=device,
)),
("net", NeuralNetClassifier(
VitModule,
module__model_name=vit_model,
module__num_classes=len(set(y_train)),
criterion=nn.CrossEntropyLoss,
device=device,
)),
])
pipeline.fit(X_train, y_train)
- working with text often requires tokenization of the text
- 🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)
- not only tokenization, but also truncation, padding, etc.
- works seamlessly with 🤗 transformers but also independently
Load a pretrained tokenizer wrapped inside a scikit-learn transformer.
from skorch.hf import HuggingfacePretrainedTokenizer
hf_tokenizer = HuggingfacePretrainedTokenizer("bert-base-uncased")
data = ["hello there", "this is a text"]
hf_tokenizer.fit(data) # only loads the model
hf_tokenizer.transform(data)
# returns
{
"input_ids": tensor([[ 101, 7592, 2045, 102, 0, ...]]),
"attention_mask": tensor([[1, 1, 1, 1, 0, ...]]),
}
Use hyper parameters from pretrained tokenizer to fit on your own data
hf_tokenizer = HuggingfacePretrainedTokenizer(
"bert-base-uncased", vocab_size=12345, train=True
)
data = ...
hf_tokenizer.fit(data) # fits new tokenizer on data
hf_tokenizer.transform(data)
Build your own tokenizer
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.normalizers import Lowercase, StripAccents
from tokenizers.pre_tokenizers import Whitespace
tokenizer = HuggingfaceTokenizer(
model__unk_token="[UNK]",
tokenizer=Tokenizer,
tokenizer__model=WordLevel,
trainer='auto',
trainer__vocab_size=1000,
trainer__special_tokens=[
"[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"
],
normalizer=Lowercase,
pre_tokenizer=Whitespace,
)
tokenizer.fit(data)
pipeline = Pipeline([
("tokenize", tokenizer),
("net", NeuralNetClassifier(BertModule, ...)),
])
params = {
"tokenize__tokenizer": [Tokenizer],
"tokenize__tokenizer__model": [WordLevel],
"tokenize__model__unk_token": ['[UNK]'],
"tokenize__trainer__special_tokens": [['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']],
'tokenize__trainer__vocab_size': [500, 1000],
'tokenize__normalizer': [Lowercase, StripAccents],
}
search = GridSearchCV(pipeline, params, refit=False)
search.fit(X, y)
from sklearn.model_selection import RandomizedSearchCV
def create_peft_model(target_modules, r=8, **kwargs):
config = peft.LoraConfig(
r=r, target_modules=target_modules, modules_to_save=["seq.4"]
)
model = MLP(**kwargs)
return peft.get_peft_model(model, config)
params = {
"module__r": [4, 8, 16],
"module__target_modules": [["seq.0"], ["seq.2"], ["seq.0", "seq.2"]],
"module__num_units_hidden": [1000, 2000],
}
search = RandomizedSearchCV(net, params, n_iter=20, random_state=0)
search.fit(X, y)
# in train.py
from torch.distributed import TCPStore
from skorch.history import DistributedHistory
accelerator = Accelerator()
is_master = accelerator.is_main_process
world_size = accelerator.num_processes
rank = accelerator.local_process_index
store = TCPStore("127.0.0.1", port=8080, world_size=world_size, is_master=is_master)
dist_history = DistributedHistory(store=store, rank=rank, world_size=world_size)
model = AcceleratedNet(
MyModule,
accelerator=accelerator,
history=dist_history,
...,
)
model.fit(X, y)
In the terminal, run: accelerate launch <args> train.py
- Hugging Face Hub is a platform to share models, datasets, demos etc.
- You can use it to store and share checkpoints of your models in the cloud for free
from huggingface_hub import HfApi
hf_api = HfApi()
hub_pickle_storer = HfHubStorage(
hf_api,
path_in_repo=<MODEL_NAME>,
repo_id=<REPO_NAME>,
token=<TOKEN>,
)
checkpoint = TrainEndCheckpoint(f_pickle=hub_pickle_storer)
net = NeuralNet(..., callbacks=[checkpoint])
Instead of saving the whole net, it’s also possible to save only a specific part, like the model weights.
- safetensors is an increasingly popular format to save model weights
- Has some important advantages over
pickle
– most notably, it is safe to load safetensor files, even if the source is not trusted
net = NeuralNet(...)
net.fit(X, y)
net.save_params(f_params="model.safetensors", use_safetensors=True)
new_net = NeuralNet(...) # use same arguments
new_net.initialize() # This is important!
new_net.load_params(f_params="model.safetensors", use_safetensors=True)
Small caveat: The optimizer cannot be stored with safetensors
; if it’s needed, use pickle
for the optimizer and safetensors for the rest.