The Second Ukrainian NLP workshop (UNLP 2023) organizes the first Shared Task in Grammatical Error Correction for Ukrainian.
Remove some fluency edits in gec-only
Fix some error types in .m2 files
Fix sentence alignment in some rare cases
Add valid.tgt.{txt,tok}
Numerous fixes in whitespace before/after punctuation -- contributed by @danmysak
Removed the second annotator's annotations from
as they were causing confusion. These annotations could be found in the original UA-GEC data or in train.m2 if you really need it.
In this shared task, your goal is to correct a text in the Ukrainian language to make it grammatical or both grammatical and fluent.
There are two tracks in this shared task: GEC-only and GEC+Fluency. It is not mandatory to participate in both subtasks, i.e., participating in either GEC-only or GEC+Fluency is acceptable.
Grammatical error correction (GEC) is a task of automatically detecting and correcting grammatical errors in written text. GEC is typically limited to making a minimal set of grammar, spelling, and punctuation edits so that the text becomes error-free.
For example:
Input: Я йти до школи.
Output: Я ходжу до школи.
English translation
Input: I goes to school.
Output: I go to school.
Fluency correction is an extension of GEC that allows for broader sentence rewrites to make a text more fluent—i.e., sounding natural to a native speaker. Fluency corrections are more subjective and may be harder to predict.
GEC+Fluency task covers corrections for grammar, spelling, punctuation, and fluency.
For example:
Input: Існуючі ціни дуже високі.
Output: Теперішні ціни дуже високі.
English translation
Input: Existing prices are very high.
Output: Current prices are very high.
We suggest using UA-GEC for training. You can find the original dataset and its description here.
We provide a preprocessed version of UA-GEC for your convenience. The two main dirs for the two tracks are:
Each of these folders contains a similar set of files:
-- original (uncorrected) textstrain.src.tok
-- original (uncorrected) texts, tokenized with Stanza.train.tgt.txt
-- corrected text, untokenized and tokenized versionstrain.m2
-- train and validation data annotated in the M2 format.
The M2 format is the same as used in CoNLL-2013 and BEA-2019 shared tasks. You don't have to work with it directly, although you can.
We encourage you to use any external data of your choice.
You can also prepare your own pre-processed version of UA-GEC if you want to.
In addition, feel free to submit corrections to the provided data via a pull request if you happen to find an error.
The validation data provided with the shared task can be used for model selection.
The final model will be evaluated on a hidden test set. We will release
and test.src.tok
files to the registered participants on
February 13, 2023. Please fill in this form to register for participation.
We annotated the test set with multiple annotators in order to somewhat compensate for the subjectivity of the task. If a correction is in agreement with at least one annotator, it will be counted as a valid one.
We provide a script that you can use for evaluation on the validation data.
- Install the requirements:
pip install -r requirements.txt
Run your model on
if you expect tokenized data). Your model should produce a corrected output. Let's say, you saved your output to a file calledvalid.tgt.txt
Run the evaluation script on your output file:
scripts/ valid.tgt.txt --m2 ./data/gec-only/valid.m2
If your output is already tokenized, add the --no-tokenize
switch to the
command above.
Under the hood, the script tokenizes your output with Stanza (unless
provided) and calls Errant
to do all the heavy lifting.
The script should give you an output like this:
=========== Span-Based Correction ============
TP FP FN Prec Rec F0.5
107 18166 2044 0.0059 0.0497 0.0071
=========== Span-Based Detection =============
TP FP FN Prec Rec F0.5
873 17393 1813 0.0478 0.325 0.0576
Correction F0.5 is the primary metric used to compare models. In order to get a true positive (TP), your edit must match at least one of the annotators' edits exactly -- both span and the suggested text.
To get a TP in span-based detection, it is enough to correctly identify erroneous tokens. The actual correction doesn't matter here.
Teams that intend to participate should register by filling in this form.
We will release the test set (uncorrected texts only), along with the instructions on how to submit your model's output, to the registered participants on February 13, 2023.
The code of the participating systems should be openly published.
Participants in the shared task are invited to submit a paper to the UNLP 2023 workshop. Please see the UNLP website for details. Accepted papers will appear in the ACL anthology and will be presented at a session of UNLP 2023 specially dedicated to the Shared Task.
Submitting a paper is not mandatory for participating in the Shared Task.
December 20, 2023 — Shared task announcement
February 12, 2023 — Registration deadline
February 13, 2023 — Release of test data to registered participants
February 20, 2023 — Shared Task paper submission
March 10, 2023 — Submission of system responses
March 13, 2023 — Notification of acceptance
March 14, 2023 — Results of the Shared Task announced to participants
March 27, 2023 — Camera-ready Shared Task papers due
May 2 or 6, 2023 — Workshop dates
Telegram group:
Email: [email protected]
- UNLP 2023 Call for Papers
- Syvokon, O., & Nahorna, O. (2021). UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. arXiv.