Extend your scikit-learn workflow with 🤗 Hugging Face and skorch

Introduction

Extend your scikit-learn workflow with 🤗 Hugging Face and skorch

link to presentation: https://github.com/BenjaminBossan/presentations

About scikit-learn

About skorch: overview

mature: first commit July 2017
deeply integrates scikit-learn and PyTorch (but not tensorflow etc.)
many examples and notebooks in repository
comprehensive docs: https://skorch.readthedocs.io

About skorch: ecosystem

About skorch: code

from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EpochScoring

class MyModule(nn.Module):
    ...

net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    callbacks=[EpochScoring(scoring="roc_auc", lower_is_better=False)],
)
net.fit(X_train, y_train)
net.predict(X_test)
net.predict_proba(X_test)

About 🤗 Hugging Face

We’re going to look at:

transformers & tokenizers
parameter efficient fine-tuning
accelerate
large language models

Transformers & tokenizers

Intro

🤗 transformers most well known Hugging Face package
used predominantly for transformers-based pretrained models
- BERT, GPT, Falcon, Llama 2, etc.
🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)

Fine-tuning a BERT model – PyTorch module

from transformers import AutoModelForSequenceClassification

class BertModule(nn.Module):
    def __init__(self, name, num_labels):
        super().__init__()
        self.num_labels = num_labels
        self.bert = AutoModelForSequenceClassification.from_pretrained(
            name, num_labels=self.num_labels
        )

    def forward(self, **kwargs):
        pred = self.bert(**kwargs)
        return pred.logits

Fine-tuning a BERT model – skorch code

from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from skorch import NeuralNetClassifier
from skorch.hf import HuggingfacePretrainedTokenizer

model_name = "distilbert-base-uncased"

pipeline = Pipeline([
    ("tokenizer", HuggingfacePretrainedTokenizer(model_name)),
    ("net", NeuralNetClassifier(
        BertModule,
        module__name=model_name,
        module__num_labels=len(set(y_train)),
        criterion=nn.CrossEntropyLoss,
    )),
])

Fine-tuning a BERT model – training and inference

pipeline.fit(X_train, y_train)

# prints
  epoch    train_loss    valid_acc    valid_loss       dur
-------  ------------  -----------  ------------  --------
      1        1.1628       0.8338        0.5839  179.8571
      2        0.3709       0.8751        0.4214  178.7779
      3        0.1523       0.8910        0.3945  178.4507

y_pred = pipeline.predict(X_test)
print(accuracy_score(y_test, y_pred))

Fine-tuning a BERT model – grid search

from sklearn.model_selection import GridSearchCV

params = {
    "net__module__name": ["distilbert-base-uncased", "bert-base-cased"],
    "net__optimizer": [torch.optim.SGD, torch.optim.Adam],
    "net__lr": [0.01, 3e-4],
    "net__max_epochs": [10, 20],
}
search = GridSearchCV(pipeline, params)
search.fit(X_train, y_train)

PEFT: Parameter efficient fine-tuning

Intro

PEFT implements several techniques to fine-tune models in an efficient manner
Some techniques are specific to language models and rely on modifying the input (not covered)
Other techniques, such as LoRA, work more generally

Training a PEFT model – setup

class MLP(nn.Module):
    def __init__(self, num_units_hidden=2000):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, 2),
            nn.LogSoftmax(dim=-1),
        )

    def forward(self, X):
        return self.seq(X)

Training a PEFT model

import peft

# to show potential candidates for target modules
# print([(n, type(m)) for n, m in MLP().named_modules()])
config = peft.LoraConfig(
    r=8,
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)
peft_model = peft.get_peft_model(MLP(), config)
# only 1.4% of parameters are trained, rest is frozen

net = NeuralNetClassifier(peft_model, ...)
net.fit(X, y)

Saving the PEFT model

peft_model = net.module_
peft_model.save_pretrained(dir_name)

Only saves the extra LoRA parameters

     478 adapter_config.json
      88 README.md
  145731 adapter_model.bin
     ---
16340459 full_model.bin

Accelerate

Intro

accelerate contains many utilities around making training and inference more efficient
Most prominently, it facilitates distributed training (DDP, FSDP, DeepSpeed, etc.)
Also contains other utilities, like mixed precision (FP16, BF16), gradient accumulation, etc.

Automatic mixed precision

from accelerate import Accelerator
from skorch import NeuralNet
from skorch.hf import AccelerateMixin

class AcceleratedNet(AccelerateMixin, NeuralNet):
    """NeuralNet with accelerate support"""

accelerator = Accelerator(mixed_precision="fp16")
net = AcceleratedNet(
    MyModule,
    accelerator=accelerator,
)
net.fit(X, y)

Large language models as zero/few-shot classifiers

Intro

Since the GPT-3 release, we know that using Large Language Models (LLM) as zero/few-shot learners is a viable approach
skorch’s ZeroShotClassifier and FewShotClassifier implement zero/few-shot classification
Use 🤗 transformers LLMs under the hood, while behaving like sklearn classifiers

`ZeroShotClassifier` – fit and predict

from skorch.llm import ZeroShotClassifier

X, y = ...
clf = ZeroShotClassifier("bigscience/bloomz-1b1")
clf.fit(X=None, y=["negative", "positive"])
y_pred = clf.predict(X)
y_proba = clf.predict_proba(X)

`ZeroShotClassifier` – custom prompt

my_prompt = """Your job is to analyze the sentiment of customer reviews.

The available sentiments are: {labels}

The customer review is:

```
{text}
```

Your response:"""

clf = ZeroShotClassifier("bigscience/bloomz-1b1", prompt=my_prompt)
clf.fit(X=None, y=["negative", "positive"])
predicted_labels = clf.predict(X)

`ZeroShotClassifier` – grid search

from sklearn.model_selection import GridSearchCV
from skorch.llm import DEFAULT_PROMPT_ZERO_SHOT

params = {
    "model_name": ["bigscience/bloomz-1b1", "gpt2", "tiiuae/falcon-7b-instruct"],
    "prompt": [DEFAULT_PROMPT_ZERO_SHOT, my_prompt],
}
metrics = ["accuracy", "neg_log_loss"]
search = GridSearchCV(clf, param_grid=params, scoring=metrics, refit=False)
search.fit(X, y)

`FewShotClassifier`

from skorch.llm import FewShotClassifier

X_train, y_train, X_test, y_test = ...
clf = FewShotClassifier("bigscience/bloomz-1b1", max_samples=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

Advantages of using `ZeroShotClassifier` and `FewShotClassifier`

Drop-in replacement for sklearn classifiers
Forces the model to output one of the provided labels
Returns probabilities, not just generated tokens
For decoder-only models, supports caching, which can lead to speed ups (does not work for encoder-decoder models)
Big choice of models from Hugging Face
Apart from initial model download, everything runs locally, no data sent to OpenAI or anyone else

When to use

When there are few labeled samples/when bootstrapping
When you want to systematically study the best prompt, best LLM model, etc.
When you need help with debugging bad LLM outputs
When the problem domain requires advanced understanding (e.g. PIQA)

When not to use

When runtime performance or resource usage are a concern
When there are a lot of labeled samples, supervised learning might work better
When the task is simple, bag-of-words or similar approaches can be better even with few samples

Wrap-up

Conclusion

Learned how skorch helps to combine sklearn and the Hugging Face ecosystem
What was shown is only part of what is possible
- Vision models, customized tokenizers, 🤗 Hub, safetensors, …
Of course, the different techniques and libraries can be combined
- e.g. sklearn Pipeline + GridSearchCV + tokenizers + transformers + accelerate + PEFT

Links:

Hugging Face: https://huggingface.co/
skorch: https://github.com/skorch-dev/skorch
presentation: https://github.com/BenjaminBossan/presentations

Backup slides

Vision tranformer model

Fine-tuning a vision transformer model – feature extraction

from sklearn.base import BaseEstimator, TransformerMixin
from transformers import ViTFeatureExtractor, ViTForImageClassification

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, model_name, device="cpu"):
        self.model_name = model_name
        self.device = device

    def fit(self, X, y=None, **fit_params):
        self.extractor_ = ViTFeatureExtractor.from_pretrained(
            self.model_name, device=self.device,
        )
        return self

    def transform(self, X):
        return self.extractor_(X, return_tensors="pt")["pixel_values"]

class VitModule(nn.Module):
    # same idea as before

Fine-tuning a vision transformer model – skorch code

vit_model = "google/vit-base-patch32-224-in21k"

pipeline = Pipeline([
    ("feature_extractor", FeatureExtractor(
        vit_model,
        device=device,
    )),
    ("net", NeuralNetClassifier(
        VitModule,
        module__model_name=vit_model,
        module__num_classes=len(set(y_train)),
        criterion=nn.CrossEntropyLoss,
        device=device,
    )),
])
pipeline.fit(X_train, y_train)

Tokenizers

Intro

working with text often requires tokenization of the text
🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)
not only tokenization, but also truncation, padding, etc.
works seamlessly with 🤗 transformers but also independently

`HuggingfacePretrainedTokenizer`

Load a pretrained tokenizer wrapped inside a scikit-learn transformer.

from skorch.hf import HuggingfacePretrainedTokenizer

hf_tokenizer = HuggingfacePretrainedTokenizer("bert-base-uncased")
data = ["hello there", "this is a text"]
hf_tokenizer.fit(data)  # only loads the model
hf_tokenizer.transform(data)
# returns
{
    "input_ids": tensor([[ 101, 7592, 2045,  102,    0, ...]]),
    "attention_mask": tensor([[1, 1, 1, 1, 0, ...]]),
}

`HuggingfacePretrainedTokenizer` – training

Use hyper parameters from pretrained tokenizer to fit on your own data

hf_tokenizer = HuggingfacePretrainedTokenizer(
    "bert-base-uncased", vocab_size=12345, train=True
)
data = ...
hf_tokenizer.fit(data)  # fits new tokenizer on data
hf_tokenizer.transform(data)

`HuggingfaceTokenizer`

Build your own tokenizer

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.normalizers import Lowercase, StripAccents
from tokenizers.pre_tokenizers import Whitespace

tokenizer = HuggingfaceTokenizer(
    model__unk_token="[UNK]",
    tokenizer=Tokenizer,
    tokenizer__model=WordLevel,
    trainer='auto',
    trainer__vocab_size=1000,
    trainer__special_tokens=[
        "[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"
    ],
    normalizer=Lowercase,
    pre_tokenizer=Whitespace,
)
tokenizer.fit(data)

`HuggingfaceTokenizer` – grid search

pipeline = Pipeline([
    ("tokenize", tokenizer),
    ("net", NeuralNetClassifier(BertModule, ...)),
])

params = {
    "tokenize__tokenizer": [Tokenizer],
    "tokenize__tokenizer__model": [WordLevel],
    "tokenize__model__unk_token": ['[UNK]'],
    "tokenize__trainer__special_tokens": [['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']],
    'tokenize__trainer__vocab_size': [500, 1000],
    'tokenize__normalizer': [Lowercase, StripAccents],
}
search = GridSearchCV(pipeline, params, refit=False)
search.fit(X, y)

PEFT

Hyper-parameter search with PEFT

from sklearn.model_selection import RandomizedSearchCV

def create_peft_model(target_modules, r=8, **kwargs):
    config = peft.LoraConfig(
        r=r, target_modules=target_modules, modules_to_save=["seq.4"]
    )
    model = MLP(**kwargs)
    return peft.get_peft_model(model, config)

params = {
    "module__r": [4, 8, 16],
    "module__target_modules": [["seq.0"], ["seq.2"], ["seq.0", "seq.2"]],
    "module__num_units_hidden": [1000, 2000],
}
search = RandomizedSearchCV(net, params, n_iter=20, random_state=0)
search.fit(X, y)

Accelerate

Distributed Data Parallel (DDP)

# in train.py
from torch.distributed import TCPStore
from skorch.history import DistributedHistory

accelerator = Accelerator()
is_master = accelerator.is_main_process
world_size = accelerator.num_processes
rank = accelerator.local_process_index
store = TCPStore("127.0.0.1", port=8080, world_size=world_size, is_master=is_master)
dist_history = DistributedHistory(store=store, rank=rank, world_size=world_size)
model = AcceleratedNet(
    MyModule,
    accelerator=accelerator,
    history=dist_history,
    ...,
)
model.fit(X, y)

In the terminal, run: accelerate launch <args> train.py

Hugging Face Hub

Intro

Hugging Face Hub is a platform to share models, datasets, demos etc.
You can use it to store and share checkpoints of your models in the cloud for free

Example

from huggingface_hub import HfApi

hf_api = HfApi()
hub_pickle_storer = HfHubStorage(
    hf_api,
    path_in_repo=<MODEL_NAME>,
    repo_id=<REPO_NAME>,
    token=<TOKEN>,
)
checkpoint = TrainEndCheckpoint(f_pickle=hub_pickle_storer)
net = NeuralNet(..., callbacks=[checkpoint])

Instead of saving the whole net, it’s also possible to save only a specific part, like the model weights.

Safetensors

Intro

safetensors is an increasingly popular format to save model weights
Has some important advantages over pickle – most notably, it is safe to load safetensor files, even if the source is not trusted

Example

net = NeuralNet(...)
net.fit(X, y)
net.save_params(f_params="model.safetensors", use_safetensors=True)

new_net = NeuralNet(...)  # use same arguments
new_net.initialize()  # This is important!
new_net.load_params(f_params="model.safetensors", use_safetensors=True)

Small caveat: The optimizer cannot be stored with safetensors; if it’s needed, use pickle for the optimizer and safetensors for the rest.

Files

presentation.org

Latest commit

History

presentation.org

File metadata and controls

Extend your scikit-learn workflow with 🤗 Hugging Face and skorch

Introduction