Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.
This project is based on the official code of ELECTRA: https://github.com/google-research/electra
Chinese LERT | Chinese/English PERT Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | TextBrewer | TextPruner
More resources by HFL: https://github.com/ymcui/HFL-Anthology
Mar 28, 2023 We open-sourced Chinese LLaMA&Alpaca LLMs, which can be quickly deployed on PC. Check: https://github.com/ymcui/Chinese-LLaMA-Alpaca
2022/10/29 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/
2022/3/30 We release a new pre-trained model called PERT, check https://github.com/ymcui/PERT
2021/12/17 We release a model pruning toolkit - TextPruner, check https://github.com/airaria/TextPruner
Dec 13, 2020 We release Chinese legal ELECTRA series, check Download, Results for Crime Prediction.
Oct 22, 2020 ELECTRA-180g released, which were trained with high-quality CommonCrawl data, check Download.
Sep 15, 2020 Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.
Past News
August 27, 2020 We are happy to announce that our model is on top of GLUE benchmark, check [leaderboard](https://gluebenchmark.com/leaderboard).May 29, 2020 We have released Chinese ELECTRA-large/small-ex models, check Download. We are sorry that only Google Drive links are available at present.
April 7, 2020 PyTorch models are available through 🤗Transformers, check Quick Load
March 31, 2020 The models in this repository now can be easily accessed through PaddleHub, check Quick Load
March 25, 2020 We have released Chinese ELECTRA-small/base models, check Download.
Section | Description |
---|---|
Introduction | Introduction to the ELECTRA |
Download | Download links for Chinese ELECTRA models |
Quick Load | Learn how to quickly load our models through 🤗Transformers or PaddleHub |
Baselines | Baseline results on MRC, Text Classification, etc. |
Usage | Detailed instructions on how to use ELECTRA |
FAQ | Frequently Asked Questions |
Citation | Citation |
ELECTRA provides a new pre-training framework, including two components: Generator and Discriminator.
- Generator: a small MLM that predicts [MASK] to its original token. The generator will replace some of the tokens in the input text.
- Discriminator: detect whether the input token is replaced. ELECTRA uses a pre-training task called Replaced Token Detection (RTD) instead of the Masked Language Model (MLM), which is used by BERT and its variants. Note that there is no Next Sentence Prediction (NSP) applied in ELECTRA.
After the pre-training stage, we only use Discriminator for finetuning the downstream tasks.
For more technical details, please check the paper: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
We provide TensorFlow models at the moment.
ELECTRA-large, Chinese
: 24-layer, 1024-hidden, 16-heads, 324M parametersELECTRA-base, Chinese
:12-layer, 768-hidden, 12-heads, 102M parametersELECTRA-small-ex, Chinese
: 24-layer, 256-hidden, 4-heads, 25M parametersELECTRA-small, Chinese
: 12-layer, 256-hidden, 4-heads, 12M parameters
Model | Google Drive | Baidu Disk | Size |
---|---|---|---|
ELECTRA-180g-large, Chinese |
TensorFlow | TensorFlow(pw:2v5r) | 1G |
ELECTRA-180g-base, Chinese |
TensorFlow | TensorFlow(pw:3vg1) | 383M |
ELECTRA-180g-small-ex, Chinese |
TensorFlow | TensorFlow(pw:93n8) | 92M |
ELECTRA-180g-small, Chinese |
TensorFlow | TensorFlow(pw:k9iu) | 46M |
Model | Google Drive | Baidu Disk | Size |
---|---|---|---|
ELECTRA-large, Chinese |
TensorFlow | TensorFlow(pw:1e14) | 1G |
ELECTRA-base, Chinese |
TensorFlow | TensorFlow(pw:f32j) | 383M |
ELECTRA-small-ex, Chinese |
TensorFlow | TensorFlow(pw:gfb1) | 92M |
ELECTRA-small, Chinese |
TensorFlow | TensorFlow(pw:1r4r) | 46M |
Model | Google Drive | Baidu Disk | Size |
---|---|---|---|
legal-ELECTRA-large, Chinese |
TensorFlow | TensorFlow(pw:q4gv) | 1G |
legal-ELECTRA-base, Chinese |
TensorFlow | TensorFlow(pw:8gcv) | 383M |
legal-ELECTRA-small, Chinese |
TensorFlow | TensorFlow(pw:kmrj) | 46M |
If you need these models in PyTorch,
- Convert TensorFlow checkpoint into PyTorch, using 🤗Transformers. Please use the script convert_electra_original_tf_checkpoint_to_pytorch.py provided by 🤗Transformers. For example,
python transformers/src/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path ./path-to-large-model/ \
--config_file ./path-to-large-model/discriminator.json \
--pytorch_dump_path ./path-to-output/model.bin \
--discriminator_or_generator discriminator
- Download from https://huggingface.co/hfl
Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window
The users from Mainland China are encouraged to use iFLYTEK Cloud download links, and the others may use Google Drive links.
The ZIP package includes the following files (For example, ELECTRA-small, Chinese
):
chinese_electra_small_L-12_H-256_A-4.zip
|- checkpoint # checkpoint
|- electra_small.data-00000-of-00001 # Model weights
|- electra_small.meta # Meta info
|- electra_small.index # Index info
|- vocab.txt # Vocabulary
We use the same data for training RoBERTa-wwm-ext model series, which includes 5.4B tokens. We also use the same vocabulary from Chinese BERT, which has 21128 tokens. Other details and hyperparameter settings are listed below (others are remain default):
ELECTRA-large
: 24-layers, 1024-hidden, 16-heads, lr: 2e-4, batch: 96, max_len: 512, 2M stepsELECTRA-base
: 12-layers, 768-hidden, 12-heads, lr: 2e-4, batch: 256, max_len: 512, 1M stepsELECTRA-small-ex
: 24-layers, 256-hidden, 4-heads, lr: 5e-4, batch: 384, max_len: 512, 2M stepsELECTRA-small
: 12-layers, 256-hidden, 4-heads, lr: 5e-4, batch: 1024, max_len: 512, 1M steps
With Huggingface-Transformers 2.8.0, the models in this repository could be easily accessed and loaded through the following codes.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)
The actual model and its MODEL_NAME
are listed below.
Original Model | Component | MODEL_NAME |
---|---|---|
ELECTRA-180g-large, Chinese | discriminator | hfl/chinese-electra-180g-large-discriminator |
ELECTRA-180g-large, Chinese | generator | hfl/chinese-electra-180g-large-generator |
ELECTRA-180g-base, Chinese | discriminator | hfl/chinese-electra-180g-base-discriminator |
ELECTRA-180g-base, Chinese | generator | hfl/chinese-electra-180g-base-generator |
ELECTRA-180g-small-ex, Chinese | discriminator | hfl/chinese-electra-180g-small-ex-discriminator |
ELECTRA-180g-small-ex, Chinese | generator | hfl/chinese-electra-180g-small-ex-generator |
ELECTRA-180g-small, Chinese | discriminator | hfl/chinese-electra-180g-small-discriminator |
ELECTRA-180g-small, Chinese | generator | hfl/chinese-electra-180g-small-generator |
ELECTRA-large, Chinese | discriminator | hfl/chinese-electra-large-discriminator |
ELECTRA-large, Chinese | generator | hfl/chinese-electra-large-generator |
ELECTRA-base, Chinese | discriminator | hfl/chinese-electra-base-discriminator |
ELECTRA-base, Chinese | generator | hfl/chinese-electra-base-generator |
ELECTRA-small-ex, Chinese | discriminator | hfl/chinese-electra-small-ex-discriminator |
ELECTRA-small-ex, Chinese | generator | hfl/chinese-electra-small-ex-generator |
ELECTRA-small, Chinese | discriminator | hfl/chinese-electra-small-discriminator |
ELECTRA-small, Chinese | generator | hfl/chinese-electra-small-generator |
Legal Version:
Original Model | Component | MODEL_NAME |
---|---|---|
legal-ELECTRA-large, Chinese | discriminator | hfl/chinese-legal-electra-large-discriminator |
legal-ELECTRA-large, Chinese | generator | hfl/chinese-legal-electra-large-generator |
legal-ELECTRA-base, Chinese | discriminator | hfl/chinese-legal-electra-base-discriminator |
legal-ELECTRA-base, Chinese | generator | hfl/chinese-legal-electra-base-generator |
legal-ELECTRA-small, Chinese | discriminator | hfl/chinese-legal-electra-small-discriminator |
legal-ELECTRA-small, Chinese | generator | hfl/chinese-legal-electra-small-generator |
With PaddleHub, we can download and install the model with one line of code.
import paddlehub as hub
module = hub.Module(name=MODULE_NAME)
The actual model and its MODULE_NAME
are listed below.
Original Model | MODULE_NAME |
---|---|
ELECTRA-base | chinese-electra-base |
ELECTRA-small | chinese-electra-small |
We compare our Chinese ELECTRA models with BERT-base
、BERT-wwm
, BERT-wwm-ext
, RoBERTa-wwm-ext
, RBT3
on six tasks.
- CMRC 2018 (Cui et al., 2019):Span-Extraction Machine Reading Comprehension (Simplified Chinese)
- DRCD (Shao et al., 2018):Span-Extraction Machine Reading Comprehension (Traditional Chinese)
- XNLI (Conneau et al., 2018):Natural Langauge Inference
- ChnSentiCorp:Sentiment Analysis
- LCQMC (Liu et al., 2018):Sentence Pair Matching
- BQ Corpus (Chen et al., 2018):Sentence Pair Matching
For ELECTRA-small/base model, we use the learning rate of 3e-4
and 1e-4
according to the original paper.
Note that we did NOT tune the hyperparameters w.r.t each task, so it is very likely that you will have better scores than ours.
To ensure the stability of the results, we run 10 times for each experiment and report the maximum and average scores (in brackets).
CMRC 2018 dataset is released by the Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation metrics: EM / F1
Model | Development | Test | Challenge | #Params |
---|---|---|---|---|
BERT-base | 65.5 (64.4) / 84.5 (84.0) | 70.0 (68.7) / 87.0 (86.3) | 18.6 (17.0) / 43.3 (41.3) | 102M |
BERT-wwm | 66.3 (65.0) / 85.6 (84.7) | 70.5 (69.1) / 87.4 (86.7) | 21.0 (19.3) / 47.0 (43.9) | 102M |
BERT-wwm-ext | 67.1 (65.6) / 85.7 (85.0) | 71.4 (70.0) / 87.7 (87.0) | 24.0 (20.0) / 47.3 (44.6) | 102M |
RoBERTa-wwm-ext | 67.4 (66.5) / 87.2 (86.5) | 72.6 (71.4) / 89.4 (88.8) | 26.2 (24.6) / 51.0 (49.1) | 102M |
RBT3 | 57.0 / 79.0 | 62.2 / 81.8 | 14.7 / 36.2 | 38M |
ELECTRA-small | 63.4 (62.9) / 80.8 (80.2) | 67.8 (67.4) / 83.4 (83.0) | 16.3 (15.4) / 37.2 (35.8) | 12M |
ELECTRA-180g-small | 63.8 / 82.7 | 68.5 / 85.2 | 15.1 / 35.8 | 12M |
ELECTRA-small-ex | 66.4 / 82.2 | 71.3 / 85.3 | 18.1 / 38.3 | 25M |
ELECTRA-180g-small-ex | 68.1 / 85.1 | 71.8 / 87.2 | 20.6 / 41.7 | 25M |
ELECTRA-base | 68.4 (68.0) / 84.8 (84.6) | 73.1 (72.7) / 87.1 (86.9) | 22.6 (21.7) / 45.0 (43.8) | 102M |
ELECTRA-180g-base | 69.3 / 87.0 | 73.1 / 88.6 | 24.0 / 48.6 | 102M |
ELECTRA-large | 69.1 / 85.2 | 73.9 / 87.1 | 23.0 / 44.2 | 324M |
ELECTRA-180g-large | 68.5 / 86.2 | 73.5 / 88.5 | 21.8 / 42.9 | 324M |
DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation metrics: EM / F1
Model | Development | Test | #Params |
---|---|---|---|
BERT-base | 83.1 (82.7) / 89.9 (89.6) | 82.2 (81.6) / 89.2 (88.8) | 102M |
BERT-wwm | 84.3 (83.4) / 90.5 (90.2) | 82.8 (81.8) / 89.7 (89.0) | 102M |
BERT-wwm-ext | 85.0 (84.5) / 91.2 (90.9) | 83.6 (83.0) / 90.4 (89.9) | 102M |
RoBERTa-wwm-ext | 86.6 (85.9) / 92.5 (92.2) | 85.6 (85.2) / 92.0 (91.7) | 102M |
RBT3 | 76.3 / 84.9 | 75.0 / 83.9 | 38M |
ELECTRA-small | 79.8 (79.4) / 86.7 (86.4) | 79.0 (78.5) / 85.8 (85.6) | 12M |
ELECTRA-180g-small | 83.5 / 89.2 | 82.9 / 88.7 | 12M |
ELECTRA-small-ex | 84.0 / 89.5 | 83.3 / 89.1 | 25M |
ELECTRA-180g-small-ex | 87.3 / 92.3 | 86.5 / 91.3 | 25M |
ELECTRA-base | 87.5 (87.0) / 92.5 (92.3) | 86.9 (86.6) / 91.8 (91.7) | 102M |
ELECTRA-180g-base | 89.6 / 94.2 | 88.9 / 93.7 | 102M |
ELECTRA-large | 88.8 / 93.3 | 88.8 / 93.6 | 324M |
ELECTRA-180g-large | 90.1 / 94.8 | 90.5 / 94.7 | 324M |
We use XNLI data for testing the NLI task. Evaluation metric: Accuracy
Model | Development | Test | #Params |
---|---|---|---|
BERT-base | 77.8 (77.4) | 77.8 (77.5) | 102M |
BERT-wwm | 79.0 (78.4) | 78.2 (78.0) | 102M |
BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) | 102M |
RoBERTa-wwm-ext | 80.0 (79.2) | 78.8 (78.3) | 102M |
RBT3 | 72.2 | 72.3 | 38M |
ELECTRA-small | 73.3 (72.5) | 73.1 (72.6) | 12M |
ELECTRA-180g-small | 74.6 | 74.6 | 12M |
ELECTRA-small-ex | 75.4 | 75.8 | 25M |
ELECTRA-180g-small-ex | 76.5 | 76.6 | 25M |
ELECTRA-base | 77.9 (77.0) | 78.4 (77.8) | 102M |
ELECTRA-180g-base | 79.6 | 79.5 | 102M |
ELECTRA-large | 81.5 | 81.0 | 324M |
ELECTRA-180g-large | 81.2 | 80.4 | 324M |
We use ChnSentiCorp data for testing sentiment analysis. Evaluation metric: Accuracy
Model | Development | Test | #Params |
---|---|---|---|
BERT-base | 94.7 (94.3) | 95.0 (94.7) | 102M |
BERT-wwm | 95.1 (94.5) | 95.4 (95.0) | 102M |
BERT-wwm-ext | 95.4 (94.6) | 95.3 (94.7) | 102M |
RoBERTa-wwm-ext | 95.0 (94.6) | 95.6 (94.8) | 102M |
RBT3 | 92.8 | 92.8 | 38M |
ELECTRA-small | 92.8 (92.5) | 94.3 (93.5) | 12M |
ELECTRA-180g-small | 94.1 | 93.6 | 12M |
ELECTRA-small-ex | 92.6 | 93.6 | 25M |
ELECTRA-180g-small-ex | 92.8 | 93.4 | 25M |
ELECTRA-base | 93.8 (93.0) | 94.5 (93.5) | 102M |
ELECTRA-180g-base | 94.3 | 94.8 | 102M |
ELECTRA-large | 95.2 | 95.3 | 324M |
ELECTRA-180g-large | 94.8 | 95.2 | 324M |
LCQMC is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metric: Accuracy
Model | Development | Test | #Params |
---|---|---|---|
BERT | 89.4 (88.4) | 86.9 (86.4) | 102M |
BERT-wwm | 89.4 (89.2) | 87.0 (86.8) | 102M |
BERT-wwm-ext | 89.6 (89.2) | 87.1 (86.6) | 102M |
RoBERTa-wwm-ext | 89.0 (88.7) | 86.4 (86.1) | 102M |
RBT3 | 85.3 | 85.1 | 38M |
ELECTRA-small | 86.7 (86.3) | 85.9 (85.6) | 12M |
ELECTRA-180g-small | 86.6 | 85.8 | 12M |
ELECTRA-small-ex | 87.5 | 86.0 | 25M |
ELECTRA-180g-small-ex | 87.6 | 86.3 | 25M |
ELECTRA-base | 90.2 (89.8) | 87.6 (87.3) | 102M |
ELECTRA-180g-base | 90.2 | 87.1 | 102M |
ELECTRA-large | 90.7 | 87.3 | 324M |
ELECTRA-180g-large | 90.3 | 87.3 | 324M |
BQ Corpus is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metric: Accuracy
Model | Development | Test | #Params |
---|---|---|---|
BERT | 86.0 (85.5) | 84.8 (84.6) | 102M |
BERT-wwm | 86.1 (85.6) | 85.2 (84.9) | 102M |
BERT-wwm-ext | 86.4 (85.5) | 85.3 (84.8) | 102M |
RoBERTa-wwm-ext | 86.0 (85.4) | 85.0 (84.6) | 102M |
RBT3 | 84.1 | 83.3 | 38M |
ELECTRA-small | 83.5 (83.0) | 82.0 (81.7) | 12M |
ELECTRA-180g-small | 83.3 | 82.1 | 12M |
ELECTRA-small-ex | 84.0 | 82.6 | 25M |
ELECTRA-180g-small-ex | 84.6 | 83.4 | 25M |
ELECTRA-base | 84.8 (84.7) | 84.5 (84.0) | 102M |
ELECTRA-180g-base | 85.8 | 84.5 | 102M |
ELECTRA-large | 86.7 | 85.1 | 324M |
ELECTRA-180g-large | 86.4 | 85.4 | 324M |
We adopt CAIL 2018 crime prediction to evaluate the performance of legal ELECTRA. Initial learning rates for small/base/large are 5e-4/3e-4/1e-4. Evaluation metric: Accuracy
Model | Development | Test | #Params |
---|---|---|---|
ELECTRA-small | 78.84 | 76.35 | 12M |
legal-ELECTRA-small | 79.60 | 77.03 | 12M |
ELECTRA-base | 80.94 | 78.41 | 102M |
legal-ELECTRA-base | 81.71 | 79.17 | 102M |
ELECTRA-large | 81.53 | 78.97 | 324M |
legal-ELECTRA-large | 82.60 | 79.89 | 324M |
Users may utilize the ELECTRA for fine-tuning their own tasks. Here we only illustrate the basic usage, and the users are encouraged to refer to the official guidelines as well.
In this tutorial, we will use ELECTRA-small
model for finetuning CMRC 2018 task.
data-dir
:working directorymodel-name
:model name, here we set aselectra-small
task-name
:task name, here we set ascmrc2018
. Our codes are adapted for all six tasks, where thetask-name
s arecmrc2018
,drcd
,xnli
,chnsenticorp
,lcqmc
,bqcorpus
.
Download ELECTRA-small model from Download section, and unzip the files into ${data-dir}/models/${model-name}
.
The folder should contain five files, including electra_model.*
, vocab.txt
, checkpoint
.
Download CMRC 2018 training and development data, and rename them as train.json
, dev.json
.
Put two files into ${data-dir}/finetuning_data/${task-name}
directory.
python run_finetuning.py \
--data-dir ${data-dir} \
--model-name ${model-name} \
--hparams params_cmrc2018.json
The data-dir
and model-name
are illustrated in previous steps. hparams
is a JSON dictionary. In this tutorial, params_cmrc2018.json
includes the hyperparameter settings for finetuning.
{
"task_names": ["cmrc2018"],
"max_seq_length": 512,
"vocab_size": 21128,
"model_size": "small",
"do_train": true,
"do_eval": true,
"write_test_outputs": true,
"num_train_epochs": 2,
"learning_rate": 3e-4,
"train_batch_size": 32,
"eval_batch_size": 32,
}
In this JSON file, we only listed some of the important hyperparameters. For all hyperparameter entries, please check configure_finetuning.py。
After running the program,
- For machine reading comprehension tasks, the predicted JSON file
cmrc2018_dev_preds.json
will be saved in${data-dir}/results/${task-name}_qa/
. You can use evaluation script to get the final scores, such aspython cmrc2018_drcd_evaluate.py dev.json cmrc2018_dev_preds.json
- For text classification tasks, the accuracy will be printed on the screen right away, such as
xnli: accuracy: 72.5 - loss: 0.67
Q: How to set learning rate in finetuning stage?
A: We recommend to use the learning rate in the paper as default (3e-4 for small, 1e-4 for base), and adjust according to your own task.
Note that the initial learning rate may be higher than that in BERT or RoBERTa.
Q: Do you have PyTorch models?
A: Yes. You can check Download.
Q: Is it possible to share the training data?
A: I am sorry that it is not possible.
Q: Do you have any future plans?
A: Stay tuned!
If you find the technical report or resource is useful, please cite our work in your paper.
@journal{cui-etal-2021-pretrain,
title={Pre-Training with Whole Word Masking for Chinese BERT},
author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
journal={IEEE Transactions on Audio, Speech and Language Processing},
year={2021},
url={https://ieeexplore.ieee.org/document/9599397},
doi={10.1109/TASLP.2021.3124365},
}
@inproceedings{cui-etal-2020-revisiting,
title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
author = "Cui, Yiming and
Che, Wanxiang and
Liu, Ting and
Qin, Bing and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
pages = "657--668",
}
Follow our official WeChat account to keep updated with our latest technologies!
Before you submit an issue:
- You are advised to read FAQ first before you submit an issue.
- Repetitive and irrelevant issues will be ignored and closed by [stable-bot](stale · GitHub Marketplace). Thank you for your understanding and support.
- We cannot acommodate EVERY request, and thus please bare in mind that there is no guarantee that your request will be met.
- Always be polite when you submit an issue.