Embeddings, Transformers and Transfer Learning
spaCy supports a number of transfer and multi-task learning workflows that can often help improve your pipeline’s efficiency or accuracy. Transfer learning refers to techniques such as word vector tables and language model pretraining. These techniques can be used to import knowledge from raw text into your pipeline, so that your models are able to generalize better from your annotated examples.
You can convert word vectors from popular tools like
FastText and Gensim,
or you can load in any pretrained transformer model if you install
spacy-transformers
. You can
also do your own language model pretraining via the
spacy pretrain
command. You can even share your
transformer or other contextual embedding model across multiple components,
which can make long pipelines several times more efficient. To use transfer
learning, you’ll need at least a few annotated examples for what you’re trying
to predict. Otherwise, you could try using a “one-shot learning” approach using
vectors and similarity.
Transformers are large and powerful neural networks that give you better accuracy, but are harder to deploy in production, as they require a GPU to run effectively. Word vectors are a slightly older technique that can give your models a smaller improvement in accuracy, and can also provide some additional capabilities.
The key difference between word-vectors and contextual language models such as transformers is that word vectors model lexical types, rather than tokens. If you have a list of terms with no context around them, a transformer model like BERT can’t really help you. BERT is designed to understand language in context, which isn’t what you have. A word vectors table will be a much better fit for your task. However, if you do have words in context – whole sentences or paragraphs of running text – word vectors will only provide a very rough approximation of what the text is about.
Word vectors are also very computationally efficient, as they map a word to a vector with a single indexing operation. Word vectors are therefore useful as a way to improve the accuracy of neural network models, especially models that are small or have received little or no pretraining. In spaCy, word vector tables are only used as static features. spaCy does not backpropagate gradients to the pretrained word vectors table. The static vectors table is usually used in combination with a smaller table of learned task-specific embeddings.
Word vectors are not compatible with most transformer models, but if you’re training another type of NLP network, it’s almost always worth adding word vectors to your model. As well as improving your final accuracy, word vectors often make experiments more consistent, as the accuracy you reach will be less sensitive to how the network is randomly initialized. High variance due to random chance can slow down your progress significantly, as you need to run many experiments to filter the signal from the noise.
Word vector features need to be enabled prior to training, and the same word vectors table will need to be available at runtime as well. You cannot add word vector features once the model has already been trained, and you usually cannot replace one word vectors table with another without causing a significant loss of performance.
Shared embedding layers
spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components. You can even update the shared layer, performing multi-task learning. Reusing the tok2vec layer between components can make your pipeline run a lot faster and result in much smaller models. However, it can make the pipeline less modular and make it more difficult to swap components or retrain parts of the pipeline. Multi-task learning can affect your accuracy (either positively or negatively), and may require some retuning of your hyper-parameters.
Shared | Independent |
---|---|
smaller: models only need to include a single copy of the embeddings | larger: models need to include the embeddings for each component |
faster: embed the documents once for your whole pipeline | slower: rerun the embedding for each component |
less composable: all components require the same embedding component in the pipeline | modular: components can be moved and swapped freely |
You can share a single transformer or other tok2vec model between multiple
components by adding a Transformer
or
Tok2Vec
component near the start of your pipeline. Components
later in the pipeline can “connect” to it by including a listener layer like
Tok2VecListener within their model.
At the beginning of training, the Tok2Vec
component will grab
a reference to the relevant listener layers in the rest of your pipeline. When
it processes a batch of documents, it will pass forward its predictions to the
listeners, allowing the listeners to reuse the predictions when they are
eventually called. A similar mechanism is used to pass gradients from the
listeners back to the model. The Transformer
component and
TransformerListener layer do the same
thing for transformer models, but the Transformer
component will also save the
transformer outputs to the
Doc._.trf_data
extension attribute,
giving you access to them after the pipeline has finished running.
Example: Shared vs. independent config
The config system lets you express model configuration
for both shared and independent embedding layers. The shared setup uses a single
Tok2Vec
component with the
Tok2Vec architecture. All other components, like
the entity recognizer, use a
Tok2VecListener layer as their model’s
tok2vec
argument, which connects to the tok2vec
component model.
Shared
In the independent setup, the entity recognizer component defines its own
Tok2Vec instance. Other components will do the
same. This makes them fully independent and doesn’t require an upstream
Tok2Vec
component to be present in the pipeline.
Independent
Using transformer models
Transformers are a family of neural network architectures that compute dense,
context-sensitive representations for the tokens in your documents. Downstream
models in your pipeline can then use these representations as input features to
improve their predictions. You can connect multiple components to a single
transformer model, with any or all of those components giving feedback to the
transformer to fine-tune it to your tasks. spaCy’s transformer support
interoperates with PyTorch and the
HuggingFace transformers
library,
giving you access to thousands of pretrained models for your pipelines. There
are many great guides to
transformer models, but for practical purposes, you can simply think of them as
drop-in replacements that let you achieve higher accuracy in exchange for
higher training and runtime costs.
Setup and installation
Once you have CUDA installed, we recommend installing PyTorch following the PyTorch installation guidelines for your package manager and CUDA version. If you skip this step, pip will install PyTorch as a dependency below, but it may not find the best version for your setup.
Example: Install PyTorch 1.11.0 for CUDA 11.3 with pip
Next, install spaCy with the extras for your CUDA version and transformers. The
CUDA extra (e.g., cuda102
, cuda113
) installs the correct version of
cupy
, which is
just like numpy
, but for GPU. You may also need to set the CUDA_PATH
environment variable if your CUDA runtime is installed in a non-standard
location. Putting it all together, if you had installed CUDA 11.3 in
/opt/nvidia/cuda
, you would run:
Installation with CUDA
For transformers
v4.0.0+ and models
that require SentencePiece
(e.g.,
ALBERT, CamemBERT, XLNet, Marian, and T5), install the additional dependencies
with:
Install sentencepiece
Runtime usage
Transformer models can be used as drop-in replacements for other types of
neural networks, so your spaCy pipeline can include them in a way that’s
completely invisible to the user. Users will download, load and use the model in
the standard way, like any other spaCy pipeline. Instead of using the
transformers as subnetworks directly, you can also use them via the
Transformer
pipeline component.
The Transformer
component sets the
Doc._.trf_data
extension attribute,
which lets you access the transformers outputs at runtime. The trained
transformer-based pipelines provided by spaCy end on _trf
, e.g.
en_core_web_trf
.
Example
You can also customize how the Transformer
component sets
annotations onto the Doc
by specifying a custom
set_extra_annotations
function. This callback will be called with the raw
input and output data for the whole batch, along with the batch of Doc
objects, allowing you to implement whatever you need. The annotation setter is
called with a batch of Doc
objects and a
FullTransformerBatch
containing the
transformers data for the batch.
Training usage
The recommended workflow for training is to use spaCy’s
config system, usually via the
spacy train
command. The training config defines all
component settings and hyperparameters in one place and lets you describe a tree
of objects by referring to creation functions, including functions you register
yourself. For details on how to get started with training your own model, check
out the training quickstart.
The [components]
section in the config.cfg
describes the pipeline components and the settings used to construct them,
including their model implementation. Here’s a config snippet for the
Transformer
component, along with matching Python code. In
this case, the [components.transformer]
block describes the transformer
component:
config.cfg
The [components.transformer.model]
block describes the model
argument passed
to the transformer component. It’s a Thinc
Model
object that will be passed into the
component. Here, it references the function
spacy-transformers.TransformerModel.v3
registered in the architectures
registry. If a key
in a block starts with @
, it’s resolved to a function and all other
settings are passed to the function as arguments. In this case, name
,
tokenizer_config
and get_spans
.
get_spans
is a function that takes a batch of Doc
objects and returns lists
of potentially overlapping Span
objects to process by the transformer. Several
built-in functions are available – for example,
to process the whole document or individual sentences. When the config is
resolved, the function is created and passed into the model as an argument.
The name
value is the name of any HuggingFace model,
which will be downloaded automatically the first time it’s used. You can also
use a local file path. For full details, see the
TransformerModel
docs.
A wide variety of PyTorch models are supported, but some might not work. If a
model doesn’t seem to work feel free to open an
issue. Additionally note that
Transformers loaded in spaCy can only be used for tensors, and pretrained
task-specific heads or text generation features cannot be used as part of the
transformer
pipeline component.
Customizing the settings
To change any of the settings, you can edit the config.cfg
and re-run the
training. To change any of the functions, like the span getter, you can replace
the name of the referenced function – e.g.
@span_getters = "spacy-transformers.sent_spans.v1"
to process sentences. You
can also register your own functions using the
span_getters
registry. For instance, the following
custom function returns Span
objects following sentence
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
subsentences of at most max_length
tokens are returned.
code.py
To resolve the config during training, spaCy needs to know about your custom
function. You can make it available via the --code
argument that can point to
a Python file. For more details on training with custom code, see the
training documentation.
Customizing the model implementations
The Transformer
component expects a Thinc
Model
object to be passed in as its model
argument. You’re not limited to the implementation provided by
spacy-transformers
– the only requirement is that your registered function
must return an object of type Model[List[Doc],FullTransformerBatch]: that
is, a Thinc model that takes a list of Doc
objects, and returns a
FullTransformerBatch
object with the
transformer data.
The same idea applies to task models that power the downstream components.
Most of spaCy’s built-in model creation functions support a tok2vec
argument,
which should be a Thinc layer of type Model[List[Doc], List[Floats2d]]. This
is where we’ll plug in our transformer model, using the
TransformerListener layer, which
sneakily delegates to the Transformer
pipeline component.
config.cfg (excerpt)
The TransformerListener layer expects
a pooling layer as the
argument pooling
, which needs to be of type Model[Ragged,Floats2d]. This
layer determines how the vector for each spaCy token will be computed from the
zero or more source rows the token is aligned against. Here we use the
reduce_mean
layer, which
averages the wordpiece rows. We could instead use
reduce_max
, or a custom
function you write yourself.
You can have multiple components all listening to the same transformer model,
and all passing gradients back to it. By default, all of the gradients will be
equally weighted. You can control this with the grad_factor
setting, which
lets you reweight the gradients from the different listeners. For instance,
setting grad_factor = 0
would disable gradients from one of the listeners,
while grad_factor = 2.0
would multiply them by 2. This is similar to having a
custom learning rate for each component. Instead of a constant, you can also
provide a schedule, allowing you to freeze the shared parameters at the start of
training.
Static vectors
If your pipeline includes a word vectors table, you’ll be able to use the
.similarity()
method on the Doc
, Span
,
Token
and Lexeme
objects. You’ll also be able
to access the vectors using the .vector
attribute, or you can look up one or
more vectors directly using the Vocab
object. Pipelines with
word vectors can also use the vectors as features for the statistical
models, which can improve the accuracy of your components.
Word vectors in spaCy are “static” in the sense that they are not learned
parameters of the statistical models, and spaCy itself does not feature any
algorithms for learning word vector tables. You can train a word vectors table
using tools such as floret,
Gensim, FastText or
GloVe, or download existing
pretrained vectors. The init vectors
command lets you
convert vectors for use with spaCy and will give you a directory you can load or
refer to in your training configs.
Using word vectors in your models
Many neural network models are able to use word vector tables as additional
features, which sometimes results in significant improvements in accuracy.
spaCy’s built-in embedding layer,
MultiHashEmbed, can be configured to use
word vector tables using the include_static_vectors
flag.
Creating a custom embedding layer
The MultiHashEmbed layer is spaCy’s recommended strategy for constructing initial word representations for your neural network models, but you can also implement your own. You can register any function to a string name, and then reference that function within your config (see the training docs for more details). To try this out, you can save the following little example to a new Python file:
If you pass the path to your file to the spacy train
command
using the --code
argument, your file will be imported, which means the
decorator registering the function will be run. Your function is now on equal
footing with any of spaCy’s built-ins, so you can drop it in instead of any
other model with the same input and output signature. For instance, you could
use it in the tagger model as follows:
Now that you have a custom function wired into the network, you can start implementing the logic you’re interested in. For example, let’s say you want to try a relatively simple embedding strategy that makes use of static word vectors, but combines them via summation with a smaller table of learned embeddings.
Creating a custom vectors implementation v3.7
You can specify a custom registered vectors class under [nlp.vectors]
in order
to use static vectors in formats other than the ones supported by
Vectors
. Extend the abstract BaseVectors
class to implement your custom vectors.
As an example, the following BPEmbVectors
class implements support for
BPEmb subword embeddings:
To use this in your pipeline, specify this registered function under
[nlp.vectors]
in your config:
Or specify it when creating a blank pipeline:
Remember to include this code with --code
when using
spacy train
and spacy package
.
Pretraining
The spacy pretrain
command lets you initialize your
models with information from raw text. Without pretraining, the models for
your components will usually be initialized randomly. The idea behind
pretraining is simple: random probably isn’t optimal, so if we have some text to
learn from, we can probably find a way to get the model off to a better start.
Pretraining uses the same config.cfg
file as the
regular training, which helps keep the settings and hyperparameters consistent.
The additional [pretraining]
section has several configuration subsections
that are familiar from the training block: the [pretraining.batcher]
,
[pretraining.optimizer]
and [pretraining.corpus]
all work the same way and
expect the same types of objects, although for pretraining your corpus does not
need to have any annotations, so you will often use a different reader, such as
the JsonlCorpus
.
You can add a [pretraining]
block to your config by setting the
--pretraining
flag on init config
or
init fill-config
:
You can then run spacy pretrain
with the updated config
and pass in optional config overrides, like the path to the raw text file:
The following defaults are used for the [pretraining]
block and merged into
your existing config when you run init config
or
init fill-config
with --pretraining
. If needed,
you can configure the settings and hyperparameters or
change the objective.
explosion/spaCy/master/spacy/default_config_pretraining.cfg
How pretraining works
The impact of spacy pretrain
varies, but it will usually
be worth trying if you’re not using a transformer model and you have
relatively little training data (for instance, fewer than 5,000 sentences).
A good rule of thumb is that pretraining will generally give you a similar
accuracy improvement to using word vectors in your model. If word vectors have
given you a 10% error reduction, pretraining with spaCy might give you another
10%, for a 20% error reduction in total.
The spacy pretrain
command will take a specific
subnetwork within one of your components, and add additional layers to build a
network for a temporary task that forces the model to learn something about
sentence structure and word cooccurrence statistics.
Pretraining produces a binary weights file that can be loaded back in at the
start of training, using the configuration option initialize.init_tok2vec
. The
weights file specifies an initial set of weights. Training then proceeds as
normal.
You can only pretrain one subnetwork from your pipeline at a time, and the
subnetwork must be typed Model[List[Doc], List[Floats2d]] (i.e. it has to be
a “tok2vec” layer). The most common workflow is to use the
Tok2Vec
component to create a shared token-to-vector layer for
several components of your pipeline, and apply pretraining to its whole model.
Configuring the pretraining
The spacy pretrain
command is configured using the
[pretraining]
section of your config file. The
component
and layer
settings tell spaCy how to find the subnetwork to
pretrain. The layer
setting should be either the empty string (to use the
whole model), or a
node reference. Most of
spaCy’s built-in model architectures have a reference named "tok2vec"
that
will refer to the right layer.
config.cfg
Connecting pretraining to training
To benefit from pretraining, your training step needs to know to initialize its
tok2vec
component with the weights learned from the pretraining step. You do
this by setting initialize.init_tok2vec
to the filename of the .bin
file
that you want to use from pretraining.
A pretraining step that runs for 5 epochs with an output path of pretrain/
, as
an example, produces pretrain/model0.bin
through pretrain/model4.bin
plus a
copy of the last iteration as pretrain/model-last.bin
. Additionally, you can
configure n_save_epoch
to tell pretraining in which epoch interval it should
save the current training progress. To use the final output to initialize your
tok2vec
layer, you could fill in this value in your config file:
config.cfg
Pretraining objectives
Two pretraining objectives are available, both of which are variants of the
cloze task Devlin et al. (2018) introduced
for BERT. The objective can be defined and configured via the
[pretraining.objective]
config block.
-
PretrainCharacters
: The"characters"
objective asks the model to predict some number of leading and trailing UTF-8 bytes for the words. For instance, settingn_characters = 2
, the model will try to predict the first two and last two characters of the word. -
PretrainVectors
: The"vectors"
objective asks the model to predict the word’s vector, from a static embeddings table. This requires a word vectors model to be trained and loaded. The vectors objective can optimize either a cosine or an L2 loss. We’ve generally found cosine loss to perform better.
These pretraining objectives use a trick that we term language modelling with approximate outputs (LMAO). The motivation for the trick is that predicting an exact word ID introduces a lot of incidental complexity. You need a large output layer, and even then, the vocabulary is too large, which motivates tokenization schemes that do not align to actual word boundaries. At the end of training, the output layer will be thrown away regardless: we just want a task that forces the network to model something about word cooccurrence statistics. Predicting leading and trailing characters does that more than adequately, as the exact word sequence could be recovered with high accuracy if the initial and trailing characters are predicted accurately. With the vectors objective, the pretraining uses the embedding space learned by an algorithm such as GloVe or Word2vec, allowing the model to focus on the contextual modelling we actual care about.