Multilingual track: ACL 60-60 initiative
Description
At ACL 2022, an ambitious 60-60 D&I Initiative was announced, targeting text and speech translation of the ACL Anthology and past recorded talks into 60 languages for the ACL’s 60th anniversary. Results of this ongoing effort will be shared with the community at ACL 2023 where IWSLT 2023 will be co-located. This track is a multilingual speech translation shared task evaluated on a subset of this data to involve the IWSLT community and larger community in this effort and spur conversations about related methodology and progress.
Data
This task is about speech translation by and for our field. Specifically, this track targets translation of oral presentations from past ACL events into a several languages. Talks cover a variety of technical content by speakers from around the world.
- Evaluation data (development and test sets) consists of oral presentations from past ACL talks from the Anthology, with human post-edited transcripts and translations.
- Training data includes publicly available corpora and pretrained models.
- The source language and a subset of the target languages are shared with other talk translation tracks
- Allowed training data is a superset of the data for all talk translation tracks - we include the same pretrained models and training corpora, with additional target languages
- We encourage joint submissions across tracks to enable additional analysis and conference discussion!
Training data
Two training conditions are proposed. First is a constrained setting in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage existing multilingual models with medium-sized resources, particularly for this task where not all language pairs share similar amounts of public datasets, we propose a “constrained with large language models” condition, where a specific set of pretrained models is allowed to extend capabilities. We also encourage the participation of teams equipped with high computational power and additional resources to maximize performance on the task, and so an “unconstrained” setting without data restrictions is also proposed.
- Constrained with pretrained models: Under this condition, all the constrained resources plus a restricted selection of pretrained models are allowed. The following pretrained models are considered part of the training data and freely usable to build submission systems:
Constrained training data (click to expand)
Data type src lang tgt lang Training corpus (URL) Version Comment speech en -- LibriSpeech v12 speech en -- How2 speech en -- Mozilla Common Voice v11.0 speech en -- TED LIUM V2/V3 speech en -- Vox Populi speech-to-text-parallel en all MuST-C v1.2/v2.0/v3.0 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr speech-to-text-parallel en all CoVoST v2 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr speech-to-text-parallel en all Europarl-ST v1.1 (4) fr, de, pt, tr text-parallel en all Europarl v10 (2) fr, de text-parallel en all Europarl v7 (4) nl, fr, de, pt text-parallel en all NewsCommentary v16 (8) ar, zh, nl, fr, de, ja, pt, ru text-parallel en all OpenSubtitles v2018 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr text-parallel en de TED2020 v1 (1) de text-parallel en ja JParaCrawl (1) ja text-parallel en all Tatoeba v2022-03-03 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr text-parallel en de ELRC-CORDIS_News v1 (1) de - Unconstrained: Any resource (additional datasets or pretrained language models included) can be used, with the important exception of evaluation sets and any data from ACL 2022 not provided on this page.
Development data
To mimic realistic test conditions where talk audio would be provided as a single file, not gold-segmented, we provide the full wav files and also automatically generated segments using SHAS as a baseline segmentation. To evaluate translation quality of system output using any input segmentation, we provide gold sentence-segmented transcripts and translations, which system output can be scored against using resegmentation following the steps below. We provide the full wav files to enable research into alternative segmentation methods.
The development data is released here.
Evaluation data
The blind evaluation data follows the same format as above. References will be released after the eval period.
The evaluation data is released here.
Full Dataset with References
The full ACL 60-60 dataset with references is hosted on the ACL Anthology here.
If you use this data in your work, we ask that you please cite the dataset paper as below:
@inproceedings{salesky-etal-2023-evaluating,
title = "Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology",
author = "Salesky, Elizabeth and
Darwish, Kareem and
Al-Badrashiny, Mohamed and
Diab, Mona and
Niehues, Jan",
booktitle = "Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)",
month = jul,
year = "2023",
address = "Toronto, Canada (in-person and online)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.iwslt-1.2",
pages = "62--78",
abstract = "We present the ACL 60/60 evaluation sets for multilingual translation of ACL 2022 technical presentations into 10 target languages. This dataset enables further research into multilingual speech translation under realistic recording conditions with unsegmented audio and domain-specific terminology, applying NLP tools to text and speech in the technical domain, and evaluating and improving model robustness to diverse speaker demographics.",
}
Languages
This task covers ten language pairs with English as the source language and ten 60-60 languages as target languages. With this number of target languages, participants are encouraged to pursue multilingual modeling and submit results to all pairs (as opposed to individual models for each language pair), though models of any type are allowed.
- Source language: English
- Target languages: Arabic, Chinese, Dutch, French, German, Japanese, Farsi, Portuguese, Russian, Turkish
- Publicly available corpora are available for these language pairs for training (e.g. MuST-C)
Submission
Submissions should be compressed into a single .tar.gz file and emailed here.
Translation into all 10 target languages is expected for official ranking, though we also encourage submissions to a subset of language pairs, and strongly encourage all participants to also submit English ASR for analysis.
Submissions should consist of plaintext files for each language pair with one sentence per line, pre-formatted for scoring (detokenized!).
Multiple submissions are allowed! If multiple outputs are submitted, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.
File names should follow the following structure:
<participant>.<constrained/unconstrained>.<primary/contrastive>.<src>-<tgt>.txt
e.g., jhu.unconstrained.primary.en-de.txt
Participants should specify in the submission email if their submission uses multilingual models and uses end-to-end or cascaded models for analysis. Training data and any pretrained models used should also be specified in the submission email; if data or pretrained models beyond the list allowed are used, the system should be marked unconstrained and will be ranked separately.
Evaluation
Translation output will be evaluated using multiple metrics for analysis: translation output using chrF, BLEU, and recent neural metrics, and ASR output using WER. Translation metrics will be calculated with case and punctuation. WER will be computed on lowercased text with punctuation removed. Official metric scores will be calculated using automatic resegmentation of the hypothesis based on the reference transcripts (ASR) or translations (MT) by mwerSegmenter.
Ranking
The official task ranking will be based on the average chrF across the 10 translation language pairs, calculated by SacreBLEU. If a submission does not include a language pair, it will receive 0 for that pair. ASR will be evaluated separately, though it is strongly encouraged to submit ASR output as well. We will provide human evaluation for language pairs where available; if we are able to provide human evaluation for all 10 languages, average human system ranking will be the official task ranking.
Metrics
To compute official metrics, first download and install mwerSegmenter following the instructions in mwerSegmenter/README
.
Then, install SacreBLEU.
wget https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz
tar -zxvf mwerSegmenter.tar.gz
# set up following mwerSegmenter/README
pip install sacrebleu
Then, given raw text translation output, run mwerSegmenter to segment it to match the reference, and evaluate with SacreBLEU:
# example: en-de
tgt=de
src=IWSLT.ACLdev2023/text/IWSLT.ACL.ACLdev2023.en-xx.en.xml
ref=IWSLT.ACLdev2023/text/IWSLT.ACL.ACLdev2023.en-xx.${tgt}.xml
out=outs/IWSLT.ACLdev2023.en-${tgt}.hyp
sys=baseline
grep "<seg id" ${ref} | sed -e "s/<[^>]*>//g" > ${ref%.xml}.txt
mwerSegmenter/segmentBasedOnMWER.sh ${src} ${ref} ${out} ${sys} ${tgt} ${out}.sgm no_normalize 1
sed -e '/^<\/\?seg\|^<\/\?doc\|^<\/\?tstset/d' ${out}.sgm > ${out}.final
conda activate py3
sacrebleu ${ref%.xml}.txt -i ${out}.final -m chrf
Note: unfortunately mwerSegmenter requires python2, and sacrebleu requires python3. You may need to switch environments between steps as shown.
Notes on Metric Tokenizers
We use chrF as the primary metric which enables use of the same metric for all target languages.
For some languages, in particular those which do not mark whitespace, it can be recommended to use language-specific tokenization to calculate BLEU (Chinese, Japanese, Korean).
Similarly, mwerSegmenter uses whitespace and segment boundaries for resegmentation, which for these languages may require character-level tokenization or language-specific tokenization.
We will use the language-specific tokenizers recommended in sacrebleu (zh
, ja-mecab
, ko-mecab
) for Chinese, Japanese, and Korean – note, though, that BLEU will be an unofficial metric.
For all other languages we will use default metric tokenization (13a
in sacrebleu, XLM-R tokenization for COMET).
For chrF, language-specific tokenization should not change the score.
It is important that you submit detokenized ASR and MT outputs so that metric tokenization can be applied appropriately.
Organizers
- Elizabeth Salesky (JHU)
- Jan Niehues (KIT)
- Mona Diab (Meta)
Contact
Chairs: [email protected]
Discussion: [email protected]