Parallel data
Parallel data for training machine translation
Parallel data or parallel corpora are data sets of translation pairs – sentences and their translations. They are used to train and test machine translation models.
Original | Translation |
---|---|
File | Archivo |
Parallel data sets can include translations for one or more language pairs, and be directioned or directionless.
Creation
Parallel data sets can be created manually, automatically, or created synthetically from monolingual data.
- Human translation
- Human post-editing
- Crawling
- Alignment
Parallel data can be created by crawling and aligned monolingual test, and by back-translation or back-copying.
Goals
Parallel data is used to train statistical and neural machine translation engines.
Challenges
Parallel data is available for most widely written language pairs, but not available for other language pairs.
Parallel data can have errors, like misaligned sentences, bad sentence segmentation, bad encodings, wrong or mixed language. Errors in parallel data are challenging because they affect the quality of the machine translation output. Parallel data errors can be solved via filtering.
Open data sets
Many of the largest data sets are publicly available.
Name | Type |
---|---|
OPUS | Data repository |
CCAligned | Data repository |
CCMatrix | Data set |
Clarin | Data repository |
Europarl | Data set |
FLORES | Data set |
Hansard | Data set |
JESC | Data set |
MaCoCu | Data set |
Mozilla Common Voice | Data set |
OpenSubtitles | Data repository |
ParaCrawl | Data repository |
VoxPopuli | Data set |
WikiMatrix | Data set |
WikiTitles | Data set |
NTREX | Data repository |