Multimodal (text + layout/format + image) pre-training for multilingual Document AI
LayoutXLM is a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. Experiment results show that it has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset.
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei, arXiv Preprint 2021
layoutxlm-base
| huggingface
Fine-tuning Example on XFUND
Please refer to layoutlmft
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_ser.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 2500 \
--per_device_train_batch_size 2 \
--warmup_ratio 0.1 \
--fp16
Results on XFUND
Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|
Semantic Entity Recognition | xlm-roberta-base |
0.667 | 0.8774 | 0.7761 | 0.6105 | 0.6743 | 0.6687 | 0.6814 | 0.6818 | 0.7047 |
infoxlm-base |
0.6852 | 0.8868 | 0.7865 | 0.6230 | 0.7015 | 0.6751 | 0.7063 | 0.7008 | 0.7207 | |
layoutxlm-base |
0.794 | 0.8924 | 0.7921 | 0.7550 | 0.7902 | 0.8082 | 0.8222 | 0.7903 | 0.8056 | |
Relation Extraction | xlm-roberta-base |
0.2659 | 0.5105 | 0.5800 | 0.5295 | 0.4965 | 0.5305 | 0.5041 | 0.3982 | 0.4769 |
infoxlm-base |
0.2920 | 0.5214 | 0.6000 | 0.5516 | 0.4913 | 0.5281 | 0.5262 | 0.4170 | 0.4910 | |
layoutxlm-base |
0.5483 | 0.7073 | 0.6963 | 0.6896 | 0.6353 | 0.6415 | 0.6551 | 0.5718 | 0.6432 |
Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|
SER | xlm-roberta-base |
0.667 | 0.4144 | 0.3023 | 0.3055 | 0.371 | 0.2767 | 0.3286 | 0.3936 | 0.3824 |
infoxlm-base |
0.6852 | 0.4408 | 0.3603 | 0.3102 | 0.4021 | 0.2880 | 0.3587 | 0.4502 | 0.4119 | |
layoutxlm-base |
0.794 | 0.6019 | 0.4715 | 0.4565 | 0.5757 | 0.4846 | 0.5252 | 0.539 | 0.5561 | |
RE | xlm-roberta-base |
0.2659 | 0.1601 | 0.2611 | 0.2440 | 0.2240 | 0.2374 | 0.2288 | 0.1996 | 0.2276 |
infoxlm-base |
0.2920 | 0.2405 | 0.2851 | 0.2481 | 0.2454 | 0.2193 | 0.2027 | 0.2049 | 0.2423 | |
layoutxlm-base |
0.5483 | 0.4494 | 0.4408 | 0.4708 | 0.4416 | 0.4090 | 0.3820 | 0.3685 | 0.4388 |
Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|
SER | xlm-roberta-base |
0.6633 | 0.883 | 0.7786 | 0.6223 | 0.7035 | 0.6814 | 0.7146 | 0.6726 | 0.7149 |
infoxlm-base |
0.6538 | 0.8741 | 0.7855 | 0.5979 | 0.7057 | 0.6826 | 0.7055 | 0.6796 | 0.7106 | |
layoutxlm-base |
0.7924 | 0.8973 | 0.7964 | 0.7798 | 0.8173 | 0.821 | 0.8322 | 0.8241 | 0.8201 | |
RE | xlm-roberta-base |
0.3638 | 0.6797 | 0.6829 | 0.6828 | 0.6727 | 0.6937 | 0.6887 | 0.6082 | 0.6341 |
infoxlm-base |
0.3699 | 0.6493 | 0.6473 | 0.6828 | 0.6831 | 0.6690 | 0.6384 | 0.5763 | 0.6145 | |
layoutxlm-base |
0.6671 | 0.8241 | 0.8142 | 0.8104 | 0.8221 | 0.8310 | 0.7854 | 0.7044 | 0.7823 |
If you find LayoutXLM useful in your research, please cite the following paper:
@article{Xu2020LayoutXLMMP,
title = {LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding},
author = {Yiheng Xu and Tengchao Lv and Lei Cui and Guoxin Wang and Yijuan Lu and Dinei Florencio and Cha Zhang and Furu Wei},
year = {2021},
eprint = {2104.08836},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
For help or issues using LayoutXLM, please submit a GitHub issue.
For other communications related to LayoutXLM, please contact Lei Cui ([email protected]
), Furu Wei ([email protected]
).