This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.
Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of
25000 texts
) took less than~28 mins.
on Colab's GPU. (Haven't perform any hardcore benchmark, so take these numbers with a grain of salt).
- numpy
- torch
- tqdm
- transformers
$ pip install git+https://github.com/khalidsaifullaah/BERTify
from bertify import BERTify
# Example 1: Bengali Embedding Extraction
bn_bertify = BERTify(
lang="bn", # language of your text.
last_four_layers_embedding=True # to get richer embeddings.
)
# By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk.
# bn_bertify.batch_size = 96
# A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction)
texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"]
bn_embeddings = bn_bertify.embedding(texts) # returns numpy matrix
# shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.)
# Example 2: English Embedding Extraction
en_bertify = BERTify(
lang="en",
last_four_layers_embedding=True
)
# bn_bertify.batch_size = 96
texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."]
en_embeddings = en_bertify.embedding(texts)
# shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)
- Try passing all your text data through the
.embedding()
function at once by turning it into a list of texts. - For faster inference, make sure you're using your colab/kaggle GPU while making the
.embedding()
call - Try increasing the
batch_size
to make it even faster, by default we're using64
(to be on the safe side) which doesn't throw anyCUDA out of memory
but I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until96
. So, before making the.embedding()
call, you can dobertify.batch_zie=96
to set a largerbatch_zie
A module for extracting embedding from BERT model for Bengali or English text datasets.
For 'en'
-> English data, it uses bert-base-uncased
model embeddings,
for 'bn'
-> Bengali data, it uses sahajBERT
model embeddings.
Parameters:
lang (str, optional)
: language of your data. Currently supports only'en'
and'bn'
. Defaults to'en'
.last_four_layers_embedding (bool, optional)
:BERT
paper discusses they've reached the best results by concatenating the output of the last four layers, so if this argument is set toTrue
, your embedding vector would be (forbert-base
model for example)4*768=3072
dimensional, otherwise it'd be768
dimensional. Defaults toFalse
.
The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.
Parameters:
texts (List[str])
: A list of texts, that you want to extract embedding for (e.g.["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"]
)
Returns:
np.ndarray
: A numpy matrix of shapenum_of_texts x embedding_dimension