MDN_Web_Docs

2023-09-25

NLLB

2023-09-07

Liv4ever and ELITR-ECA

2021-12-08

CCMatrix

2021-06-28

Updated: ParaCrawl and MultiParaCrawl

2021-06-11

New: MT560 dataset

2021-04-02

GoURMET and MIZAN

2020-11-27

EuroPat and tico-19

2020-10-31

OPUS-100 corpus

2020-06-30

ELRC public

2020-05-22

MultiParaCrawl

2019-10-16

Infopankki v1

2019-10-14

New corpus: memat (Xhosa/English)

2018-10-06

New corpora: ParaCrawl, XhosaNavy

2018-02-15

New version: OpenSubtitles2018

2017-11-06

An overview of the OPUS collection

1,210 corpora

45,945,946,108 total sentence pairs

744 languages available

This table displays 98 corpora , which make up a total 93.40% of the entire OPUS collection

CorpusSentences% of OPUS
NLLB13B28.31
CCMatrix11B23.64
OpenSubtitles8.5B18.53
MultiCCAligned2.2B4.87840
ParaCrawl1.5B3.26229
DGT1.1B2.37845
XLEnt883M1.92148
MultiParaCrawl789M1.71653
LinguaTools-WikiTitles487M1.06082
CCAligned439M0.95442
UNPC323M0.70381
EUbookshop279M0.60726
EMEA243M0.52879
GNOME225M0.48911
KDE4201M0.43720
Europarl186M0.40509
MultiUN159M0.34712
JRC-Acquis147M0.32053
TED2020143M0.31135
TildeMODEL128M0.27855
WikiMatrix127M0.27681
QED122M0.26597
HPLT96M0.20879
EuroPat89M0.19392
bible-uedin85M0.18574
MultiHPLT77M0.16845
NeuLab-TedTalks74M0.16129
Samanantar50M0.10833
Tanzil42M0.09196
MultiMaCoCu28M0.05988
JParaCrawl26M0.05621
MaCoCu26M0.05558
wikimedia23M0.04988
giga-fren23M0.04901
ELITR-ECA20M0.04390
Anuvaad18M0.03945
StanfordNLP-NMT16M0.03466
ECB15M0.03339
Wikipedia13M0.02819
SETIMES8.8M0.01916
Tatoeba8.7M0.01894
DOGC8.5M0.01844
GlobalVoices7.3M0.01579
News-Commentary6.4M0.01399
PHP6.1M0.01335
MBS5M0.01094
SciELO3.8M0.008214955
Finlex3.1M0.006777834
infopankki2.9M0.006386633
JESC2.8M0.006088435
fiskmo2.1M0.004570588
ParIce2.1M0.004564107
GoURMET2.1M0.004561434
EUconst2.1M0.004494610
OpenOffice2M0.004455240
EOPC2M0.004383612
TED20131.9M0.004145450
EhuHac1.8M0.003912167
pmindia1.7M0.003691477
SUMMA1.6M0.003427095
IITB1.6M0.003386009
Books1.3M0.002721964
ChuBiCo1.2M0.002616150
CAPES1.2M0.002519504
Joshua-IPC1.1M0.002403668
MIZAN1M0.002223476
SCB_MT_EN_TH988K0.002150919
MDN_Web_Docs874K0.001903310
ECDC683K0.001487596
Elhuyar642K0.001398049
EiTB-ParCC637K0.001386808
TEP612K0.001332187
KDEdoc610K0.001328570
WMT-News447K0.000973727
KFTT440K0.000958270
tico-19319K0.000695130
tldr-pages258K0.000561061
memat155K0.000336837
hrenWaC99K0.000215473
TedTalks86K0.000187934
FFR82K0.000178764
SPC68K0.000147478
MontenegrinSubs65K0.000141562
OfisPublik63K0.000138036
XhosaNavy50K0.000108782
Bianet48K0.000105552
WikiSource33K0.000072439
ALT18K0.000039370
Salome9.4K0.000020515
sardware6.2K0.000013385
ada834.1K0.000008971
RF1.2K0.000002516
komiNot specifiedNot specified
liv4everNot specifiedNot specified
Mozilla-I10nNot specifiedNot specified
Nunavut_HansardNot specifiedNot specified
UbuntuNot specifiedNot specified
WikiTitlesNot specifiedNot specified

OUR CONTRIBUTORS

NLPLuniversity of helsinkicschpltlets mt