Skip to content

Latest commit

 

History

History
510 lines (467 loc) · 18.4 KB

presentation.org

File metadata and controls

510 lines (467 loc) · 18.4 KB

Extend your scikit-learn workflow with 🤗 Hugging Face and skorch

Introduction

Extend your scikit-learn workflow with 🤗 Hugging Face and skorch

link to presentation: https://github.com/BenjaminBossan/presentations

About scikit-learn

./assets/scikit-learn.png

About skorch: overview

About skorch: ecosystem

./assets/skorch_torch_sklearn_eco_2.svg

About skorch: code

from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EpochScoring

class MyModule(nn.Module):
    ...

net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    callbacks=[EpochScoring(scoring="roc_auc", lower_is_better=False)],
)
net.fit(X_train, y_train)
net.predict(X_test)
net.predict_proba(X_test)

About 🤗 Hugging Face

./assets/hf.png

About 🤗 Hugging Face

We’re going to look at:

  1. transformers & tokenizers
  2. parameter efficient fine-tuning
  3. accelerate
  4. large language models

Transformers & tokenizers

Intro

  • 🤗 transformers most well known Hugging Face package
  • used predominantly for transformers-based pretrained models
    • BERT, GPT, Falcon, Llama 2, etc.
  • 🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)

Fine-tuning a BERT model – PyTorch module

from transformers import AutoModelForSequenceClassification

class BertModule(nn.Module):
    def __init__(self, name, num_labels):
        super().__init__()
        self.num_labels = num_labels
        self.bert = AutoModelForSequenceClassification.from_pretrained(
            name, num_labels=self.num_labels
        )

    def forward(self, **kwargs):
        pred = self.bert(**kwargs)
        return pred.logits

Fine-tuning a BERT model – skorch code

from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from skorch import NeuralNetClassifier
from skorch.hf import HuggingfacePretrainedTokenizer

model_name = "distilbert-base-uncased"

pipeline = Pipeline([
    ("tokenizer", HuggingfacePretrainedTokenizer(model_name)),
    ("net", NeuralNetClassifier(
        BertModule,
        module__name=model_name,
        module__num_labels=len(set(y_train)),
        criterion=nn.CrossEntropyLoss,
    )),
])

Fine-tuning a BERT model – training and inference

pipeline.fit(X_train, y_train)

# prints
  epoch    train_loss    valid_acc    valid_loss       dur
-------  ------------  -----------  ------------  --------
      1        1.1628       0.8338        0.5839  179.8571
      2        0.3709       0.8751        0.4214  178.7779
      3        0.1523       0.8910        0.3945  178.4507

y_pred = pipeline.predict(X_test)
print(accuracy_score(y_test, y_pred))

Fine-tuning a BERT model – grid search

from sklearn.model_selection import GridSearchCV

params = {
    "net__module__name": ["distilbert-base-uncased", "bert-base-cased"],
    "net__optimizer": [torch.optim.SGD, torch.optim.Adam],
    "net__lr": [0.01, 3e-4],
    "net__max_epochs": [10, 20],
}
search = GridSearchCV(pipeline, params)
search.fit(X_train, y_train)

Further reading

PEFT: Parameter efficient fine-tuning

Intro

  • PEFT implements several techniques to fine-tune models in an efficient manner
  • Some techniques are specific to language models and rely on modifying the input (not covered)
  • Other techniques, such as LoRA, work more generally

Training a PEFT model – setup

class MLP(nn.Module):
    def __init__(self, num_units_hidden=2000):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, 2),
            nn.LogSoftmax(dim=-1),
        )

    def forward(self, X):
        return self.seq(X)

Training a PEFT model

import peft

# to show potential candidates for target modules
# print([(n, type(m)) for n, m in MLP().named_modules()])
config = peft.LoraConfig(
    r=8,
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)
peft_model = peft.get_peft_model(MLP(), config)
# only 1.4% of parameters are trained, rest is frozen

net = NeuralNetClassifier(peft_model, ...)
net.fit(X, y)

Saving the PEFT model

peft_model = net.module_
peft_model.save_pretrained(dir_name)

Only saves the extra LoRA parameters

     478 adapter_config.json
      88 README.md
  145731 adapter_model.bin
     ---
16340459 full_model.bin

Further reading

Accelerate

Intro

Automatic mixed precision

from accelerate import Accelerator
from skorch import NeuralNet
from skorch.hf import AccelerateMixin

class AcceleratedNet(AccelerateMixin, NeuralNet):
    """NeuralNet with accelerate support"""

accelerator = Accelerator(mixed_precision="fp16")
net = AcceleratedNet(
    MyModule,
    accelerator=accelerator,
)
net.fit(X, y)

Further reading

Large language models as zero/few-shot classifiers

Intro

  • Since the GPT-3 release, we know that using Large Language Models (LLM) as zero/few-shot learners is a viable approach
  • skorch’s ZeroShotClassifier and FewShotClassifier implement zero/few-shot classification
  • Use 🤗 transformers LLMs under the hood, while behaving like sklearn classifiers

ZeroShotClassifier – fit and predict

from skorch.llm import ZeroShotClassifier

X, y = ...
clf = ZeroShotClassifier("bigscience/bloomz-1b1")
clf.fit(X=None, y=["negative", "positive"])
y_pred = clf.predict(X)
y_proba = clf.predict_proba(X)

ZeroShotClassifier – custom prompt

my_prompt = """Your job is to analyze the sentiment of customer reviews.

The available sentiments are: {labels}

The customer review is:

```
{text}
```

Your response:"""

clf = ZeroShotClassifier("bigscience/bloomz-1b1", prompt=my_prompt)
clf.fit(X=None, y=["negative", "positive"])
predicted_labels = clf.predict(X)

ZeroShotClassifier – grid search

from sklearn.model_selection import GridSearchCV
from skorch.llm import DEFAULT_PROMPT_ZERO_SHOT

params = {
    "model_name": ["bigscience/bloomz-1b1", "gpt2", "tiiuae/falcon-7b-instruct"],
    "prompt": [DEFAULT_PROMPT_ZERO_SHOT, my_prompt],
}
metrics = ["accuracy", "neg_log_loss"]
search = GridSearchCV(clf, param_grid=params, scoring=metrics, refit=False)
search.fit(X, y)

FewShotClassifier

from skorch.llm import FewShotClassifier

X_train, y_train, X_test, y_test = ...
clf = FewShotClassifier("bigscience/bloomz-1b1", max_samples=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

Advantages of using ZeroShotClassifier and FewShotClassifier

  • Drop-in replacement for sklearn classifiers
  • Forces the model to output one of the provided labels
  • Returns probabilities, not just generated tokens
  • For decoder-only models, supports caching, which can lead to speed ups (does not work for encoder-decoder models)
  • Big choice of models from Hugging Face
  • Apart from initial model download, everything runs locally, no data sent to OpenAI or anyone else

When to use

  • When there are few labeled samples/when bootstrapping
  • When you want to systematically study the best prompt, best LLM model, etc.
  • When you need help with debugging bad LLM outputs
  • When the problem domain requires advanced understanding (e.g. PIQA)

When not to use

  • When runtime performance or resource usage are a concern
  • When there are a lot of labeled samples, supervised learning might work better
  • When the task is simple, bag-of-words or similar approaches can be better even with few samples

Further reading

Wrap-up

Conclusion

  • Learned how skorch helps to combine sklearn and the Hugging Face ecosystem
  • What was shown is only part of what is possible
    • Vision models, customized tokenizers, 🤗 Hub, safetensors, …
  • Of course, the different techniques and libraries can be combined
    • e.g. sklearn Pipeline + GridSearchCV + tokenizers + transformers + accelerate + PEFT

Links:

Backup slides

Vision tranformer model

Fine-tuning a vision transformer model – feature extraction

from sklearn.base import BaseEstimator, TransformerMixin
from transformers import ViTFeatureExtractor, ViTForImageClassification

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, model_name, device="cpu"):
        self.model_name = model_name
        self.device = device

    def fit(self, X, y=None, **fit_params):
        self.extractor_ = ViTFeatureExtractor.from_pretrained(
            self.model_name, device=self.device,
        )
        return self

    def transform(self, X):
        return self.extractor_(X, return_tensors="pt")["pixel_values"]

class VitModule(nn.Module):
    # same idea as before

Fine-tuning a vision transformer model – skorch code

vit_model = "google/vit-base-patch32-224-in21k"

pipeline = Pipeline([
    ("feature_extractor", FeatureExtractor(
        vit_model,
        device=device,
    )),
    ("net", NeuralNetClassifier(
        VitModule,
        module__model_name=vit_model,
        module__num_classes=len(set(y_train)),
        criterion=nn.CrossEntropyLoss,
        device=device,
    )),
])
pipeline.fit(X_train, y_train)

Tokenizers

Intro

  • working with text often requires tokenization of the text
  • 🤗 tokenizers provide a wide range of techniques and pretrained tokenizers (BPE, word piece, …)
  • not only tokenization, but also truncation, padding, etc.
  • works seamlessly with 🤗 transformers but also independently

HuggingfacePretrainedTokenizer

Load a pretrained tokenizer wrapped inside a scikit-learn transformer.

from skorch.hf import HuggingfacePretrainedTokenizer

hf_tokenizer = HuggingfacePretrainedTokenizer("bert-base-uncased")
data = ["hello there", "this is a text"]
hf_tokenizer.fit(data)  # only loads the model
hf_tokenizer.transform(data)
# returns
{
    "input_ids": tensor([[ 101, 7592, 2045,  102,    0, ...]]),
    "attention_mask": tensor([[1, 1, 1, 1, 0, ...]]),
}

HuggingfacePretrainedTokenizer – training

Use hyper parameters from pretrained tokenizer to fit on your own data

hf_tokenizer = HuggingfacePretrainedTokenizer(
    "bert-base-uncased", vocab_size=12345, train=True
)
data = ...
hf_tokenizer.fit(data)  # fits new tokenizer on data
hf_tokenizer.transform(data)

HuggingfaceTokenizer

Build your own tokenizer

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.normalizers import Lowercase, StripAccents
from tokenizers.pre_tokenizers import Whitespace

tokenizer = HuggingfaceTokenizer(
    model__unk_token="[UNK]",
    tokenizer=Tokenizer,
    tokenizer__model=WordLevel,
    trainer='auto',
    trainer__vocab_size=1000,
    trainer__special_tokens=[
        "[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"
    ],
    normalizer=Lowercase,
    pre_tokenizer=Whitespace,
)
tokenizer.fit(data)

HuggingfaceTokenizer – grid search

pipeline = Pipeline([
    ("tokenize", tokenizer),
    ("net", NeuralNetClassifier(BertModule, ...)),
])

params = {
    "tokenize__tokenizer": [Tokenizer],
    "tokenize__tokenizer__model": [WordLevel],
    "tokenize__model__unk_token": ['[UNK]'],
    "tokenize__trainer__special_tokens": [['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']],
    'tokenize__trainer__vocab_size': [500, 1000],
    'tokenize__normalizer': [Lowercase, StripAccents],
}
search = GridSearchCV(pipeline, params, refit=False)
search.fit(X, y)

PEFT

Hyper-parameter search with PEFT

from sklearn.model_selection import RandomizedSearchCV

def create_peft_model(target_modules, r=8, **kwargs):
    config = peft.LoraConfig(
        r=r, target_modules=target_modules, modules_to_save=["seq.4"]
    )
    model = MLP(**kwargs)
    return peft.get_peft_model(model, config)

params = {
    "module__r": [4, 8, 16],
    "module__target_modules": [["seq.0"], ["seq.2"], ["seq.0", "seq.2"]],
    "module__num_units_hidden": [1000, 2000],
}
search = RandomizedSearchCV(net, params, n_iter=20, random_state=0)
search.fit(X, y)

Accelerate

Distributed Data Parallel (DDP)

# in train.py
from torch.distributed import TCPStore
from skorch.history import DistributedHistory

accelerator = Accelerator()
is_master = accelerator.is_main_process
world_size = accelerator.num_processes
rank = accelerator.local_process_index
store = TCPStore("127.0.0.1", port=8080, world_size=world_size, is_master=is_master)
dist_history = DistributedHistory(store=store, rank=rank, world_size=world_size)
model = AcceleratedNet(
    MyModule,
    accelerator=accelerator,
    history=dist_history,
    ...,
)
model.fit(X, y)

In the terminal, run: accelerate launch <args> train.py

Hugging Face Hub

Intro

  • Hugging Face Hub is a platform to share models, datasets, demos etc.
  • You can use it to store and share checkpoints of your models in the cloud for free

Example

from huggingface_hub import HfApi

hf_api = HfApi()
hub_pickle_storer = HfHubStorage(
    hf_api,
    path_in_repo=<MODEL_NAME>,
    repo_id=<REPO_NAME>,
    token=<TOKEN>,
)
checkpoint = TrainEndCheckpoint(f_pickle=hub_pickle_storer)
net = NeuralNet(..., callbacks=[checkpoint])

Instead of saving the whole net, it’s also possible to save only a specific part, like the model weights.

Further reading

Safetensors

Intro

  • safetensors is an increasingly popular format to save model weights
  • Has some important advantages over pickle – most notably, it is safe to load safetensor files, even if the source is not trusted

Example

net = NeuralNet(...)
net.fit(X, y)
net.save_params(f_params="model.safetensors", use_safetensors=True)

new_net = NeuralNet(...)  # use same arguments
new_net.initialize()  # This is important!
new_net.load_params(f_params="model.safetensors", use_safetensors=True)

Small caveat: The optimizer cannot be stored with safetensors; if it’s needed, use pickle for the optimizer and safetensors for the rest.

Further reading