ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. AraELECTRA achieves state-of-the-art results on Arabic QA dataset.
For a detailed description, please refer to the AraELECTRA paper AraELECTRA: Pre-Training Text Discriminatorsfor Arabic Language Understanding.
This repository contains code to pre-train ELECTRA. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).
We are releasing two pre-trained models:
Model | Layers | Hidden Size | Attention Heads | Params | HuggingFace Model Name |
---|---|---|---|---|---|
AraELECTRA-base-discriminator | 12 | 12 | 768 | 136M | araelectra-base-discriminator |
AraELECTRA-base-generator | 12 | 4 | 256 | 60M | araelectra-base-generator |
Model | TyDiQA (EM - F1 ) | ARCD (EM - F1 ) |
---|---|---|
AraBERTv0.1 | 68.51 - 82.86 | 31.62 - 67.45 |
AraBERTv1 | 61.11 - 79.36 | 31.7 - 67.8 |
AraBERTv0.2-base | 73.07 - 85.41 | 32.76 - 66.53 |
AraBERTv2-base | 61.67 - 81.66 | 31.34 - 67.23 |
AraBERTv0.2-large | 73.72 - 86.03 | 36.89 - 71.32 |
AraBERTv2-large | 64.49 - 82.51 | 34.19 - 68.12 |
ArabicBERT-base | 67.42 - 81.24 | 30.48 - 62.24 |
ArabicBERT-large | 70.03 - 84.12 | 33.33 - 67.27 |
Arabic-ALBERT-base | 67.10 - 80.98 | 30.91 - 61.33 |
Arabic-ALBERT-large | 68.07 - 81.59 | 34.19 - 65.41 |
Arabic-ALBERT-xlarge | 71.12 - 84.59 | 37.75 - 68.03 |
AraELECTRA | 74.91 - 86.68 | 37.03 - 71.22 |
- Python 3
- TensorFlow 1.15 (although we hope to support TensorFlow 2.0 at a future date)
- NumPy
- scikit-learn and SciPy (for computing some evaluation metrics).
Use build_pretraining_dataset.py
or build_arabert_pretraining_data.py
to create a pre-training dataset from a dump of raw text. It has the following arguments:
--corpus-dir
: A directory containing raw text files to turn into ELECTRA examples. A text file can contain multiple documents with empty lines separating them.--vocab-file
: File defining the wordpiece vocabulary.--output-dir
: Where to write out ELECTRA examples.--max-seq-length
: The number of tokens per example (128 by default).--num-processes
: If >1 parallelize across multiple processes (1 by default).--blanks-separate-docs
: Whether blank lines indicate document boundaries (True by default).--do-lower-case/--no-lower-case
: Whether to lower case the input text (True by default).
Use run_pretraining.py
to pre-train an ELECTRA model. It has the following arguments:
--data-dir
: a directory where pre-training data, model weights, etc. are stored. By default, the training loads examples from<data-dir>/pretrain_tfrecords
and a vocabulary from<data-dir>/vocab.txt
.--model-name
: a name for the model being trained. Model weights will be saved in<data-dir>/models/<model-name>
by default.--hparams
(optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. Seeconfigure_pretraining.py
for the supported hyperparameters.
If training is halted, re-running the run_pretraining.py
with the same arguments will continue the training where it left off.
You can continue pre-training from the released ELECTRA checkpoints by
- Setting the model-name to point to a downloaded model (e.g.,
--model-name electra_small
if you downloaded weights to$DATA_DIR/electra_small
). - Setting
num_train_steps
by (for example) adding"num_train_steps": 4010000
to the--hparams
. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps). - Increase the learning rate to account for the linear learning rate decay. For example, to start with a learning rate of 2e-4 you should set the
learning_rate
hparam to 2e-4 * (4e6 + 10000) / 10000. - For ELECTRA-Small, you also need to specifiy
"generator_hidden_size": 1.0
in thehparams
because we did not use a small generator for that model.
To evaluate the model on a downstream task, see the below finetuning instructions. To evaluate the generator/discriminator on the openwebtext data run python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'
. This will print out eval metrics such as the accuracy of the generator and discriminator, and also writing the metrics out to data-dir/model-name/results
.
Use run_finetuning.py
to fine-tune and evaluate an ELECTRA model on a downstream NLP task. It expects three arguments:
--data-dir
: a directory where data, model weights, etc. are stored. By default, the script loads finetuning data from<data-dir>/finetuning_data/<task-name>
and a vocabulary from<data-dir>/vocab.txt
.--model-name
: a name of the pre-trained model: the pre-trained weights should exist indata-dir/models/model-name
.--hparams
: a JSON dict containing model hyperparameters, data paths, etc. (e.g.,--hparams '{"task_names": ["rte"], "model_size": "base", "learning_rate": 1e-4, ...}'
). Seeconfigure_pretraining.py
for the supported hyperparameters. Instead of a dict, this can also be a path to a.json
file containing the hyperparameters. You must specify the"task_names"
and"model_size"
(see examples below).
Eval metrics will be saved in data-dir/model-name/results
and model weights will be saved in data-dir/model-name/finetuning_models
by default. Evaluation is done on the dev set by default. To customize the training, add --hparams '{"hparam1": value1, "hparam2": value2, ...}'
to the run command. Some particularly useful options:
"debug": true
fine-tunes a tiny ELECTRA model for a few steps."task_names": ["task_name"]
: specifies the tasks to train on. A list because the codebase nominally supports multi-task learning, (although be warned this has not been thoroughly tested)."model_size": one of "small", "base", or "large"
: determines the size of the model; you must set this to the same size as the pre-trained model."do_train" and "do_eval"
: train and/or evaluate a model (both are set to true by default). For using"do_eval": true
with"do_train": false
, you need to specify theinit_checkpoint
, e.g.,python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"], "do_train": false, "do_eval": true, "init_checkpoint": "<data-dir>/models/electra_base/finetuning_models/mnli_model_1"}'
"num_trials": n
: If >1, does multiple fine-tuning/evaluation runs with different random seeds."learning_rate": lr, "train_batch_size": n
, etc. can be used to change training hyperparameters."model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}
, etc. can be used to changed the hyperparameters for the underlying transformer (the"model_size"
flag sets the default values).
Get a pre-trained ELECTRA model either by training your own (see pre-training instructions above), or downloading the release ELECTRA weights and unziping them under $DATA_DIR/models
(e.g., you should have a directory$DATA_DIR/models/electra_large
if you are using the large model).
The code supports SQuAD 1.1 and 2.0, as well as datasets in the 2019 MRQA shared task
- ARCD: Download the train/dev datasets from
https://github.com/husseinmozannar/SOQAL
move them under$DATA_DIR/finetuning_data/squadv1/(train|dev).json
Then run (for example)
python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["squad"]}'
This repository uses the official evaluation code released by the SQuAD authors
or you can use the transformers
library as shown in the notebook ARCD_pytorch.ipynb
of Tydiqa_ar_pytorch.ipynb
from the examples folder
Download the CoNLL-2000 text chunking dataset from here and put it under $DATA_DIR/finetuning_data/chunk/(train|dev).txt
. Then run
python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["chunk"]}'
The easiest way to run on a new task is to implement a new finetune.task.Task
, add it to finetune.task_builder.py
, and then use run_finetuning.py
as normal. For classification/qa/sequence tagging, you can inherit from a finetune.classification.classification_tasks.ClassificationTask
, finetune.qa.qa_tasks.QATask
, or finetune.tagging.tagging_tasks.TaggingTask
.
For preprocessing data, we use the same tokenizer as BERT.
@inproceedings{antoun-etal-2021-araelectra,
title = "{A}ra{ELECTRA}: Pre-Training Text Discriminators for {A}rabic Language Understanding",
author = "Antoun, Wissam and
Baly, Fady and
Hajj, Hazem",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = apr,
year = "2021",
address = "Kyiv, Ukraine (Virtual)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.wanlp-1.20",
pages = "191--195",
}