This repository uses the CONLL-U evaluation script available at https://github.com/ufal/conll2017 to make a comparison regarding accuracy between UDPipe models and Spacy models which are trained on the same treebanks. In order to do the comparison, the evaluation script used in the CoNLL 2017 Shared Task explained at http://universaldependencies.org/conll17/evaluation.html is used.
- It does this evaluation for French (UD_French-Sequoia treebank), Dutch (UD_Dutch treebank), Spanish (UD_Spanish-Ancora treebank), Portuguese (UD_Portuguese treebank), Italian (UD_Italian treebank) and English (UD_English treebank)
- The udpipe and spacy models used for evaluation were trained on the same treebanks from www.universaldependencies.org except for English. For English the udpipe models are constructed on a different treebank (UD_English) than the model which was built by spacy (which was trained on the Ontonotes treebank). All models were trained on the training set of each of the respective treebanks, while evalution shown below is executed on the left out test set.
- For the spacy models, we took the models currently available for download as in python -m spacy download es` (spaCy Version: 2.0.7), these were build on version 2.0 of the UD treebanks. For the udpipe models, the official models provided by the udpipe authors which were trained on 2017-08-01 on version 2.0-test of the treebanks were used (models available at https://github.com/jwijffels/udpipe.models.ud.2.0). Except for English, for which we built a model on 2018-01-11 on the newest version 2.1 of the UD_English treebank (code for this is openly available at https://github.com/bnosac/udpipe.models.ud).
- Code was run on 2018-02-11. Just run the R script udpipe-spacy-comparison.R if you want to reproduce this. Code run on a Ubuntu Linux machine with LANG nl_BE.UTF-8.
Below the output is reported from the CONLL17 evaluation scripts for the udpipe and spacy models. The most used results are the one from AligndAcc which indicate Gold accuracies which means that if we know the tokenisation, how good would the parts-of-speech tagging, morphological feature tagging and dependency parsing be.
The following shows the graphs comparing the UDPipe and spaCy models using the evaluation script used in the CoNLL 2017 Shared Task. It shows the word-aligned accuracies of the different NLP tasks and also the F1 measure.
You can look at the numbers below but when looking at the metrics below AligndAcc
, they seem to give the following conclusion:
- French (UD_French-Sequoia treebank) udpipe gives better accuracies than spacy regarding all accuracy metrics
- Dutch (UD_Dutch treebank): udpipe better regarding upos/lemmatising than spacy, xpos is similar between udpipe & spacy, spacy better regarding dependency parsing
- Spanish (UD_Spanish-Ancora treebank): udpipe gives better accuracies than spacy on parts of speech tagging and lemmatisation while they are pretty much on par regarding dependencies
- Portuguese (UD_Portuguese Bosque treebank): udpipe gives better accuracies than spacy regarding all accuracy metrics
- Italian (UD_Italian treebank): udpipe gives better accuracies than spacy regarding all accuracy metrics
- English (UD_English treebank): udpipe gives better accuracies than spacy regarding universal and penn-treebank based parts of speech tagging but the spacy model was built on the Ontonotes treebank while the udpipe model was trained on the training set of the UD_English treebank so this might as well been caused by general treebank differences
Evaluation data from https://github.com/UniversalDependencies/UD_French-Sequoia release 2.0-test
Notes: This treebank does not contain xpos so measures of XPOS are irrelevant.
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.84 | 99.85 | 99.84 |
Sentences | 92.00 | 95.83 | 93.88 |
Words | 98.90 | 99.36 | 99.13 |
UPOS | 95.81 | 96.26 | 96.03 | 96.88
XPOS | 98.90 | 99.36 | 99.13 | 100.00
Feats | 94.95 | 95.39 | 95.17 | 96.00
AllTags | 93.88 | 94.32 | 94.10 | 94.92
Lemmas | 96.62 | 97.07 | 96.85 | 97.70
UAS | 83.77 | 84.16 | 83.96 | 84.70
LAS | 81.19 | 81.57 | 81.38 | 82.09
CLAS | 77.25 | 76.82 | 77.03 | 76.98
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 97.53 | 98.80 | 98.16 |
Sentences | 78.49 | 88.82 | 83.33 |
Words | 94.41 | 92.69 | 93.54 |
UPOS | 90.84 | 89.18 | 90.00 | 96.22
XPOS | 94.41 | 92.69 | 93.54 | 100.00
Feats | 89.94 | 88.30 | 89.11 | 95.27
AllTags | 88.87 | 87.25 | 88.06 | 94.14
Lemmas | 80.21 | 78.75 | 79.47 | 84.96
UAS | 77.38 | 75.97 | 76.67 | 81.96
LAS | 74.12 | 72.77 | 73.43 | 78.51
CLAS | 71.35 | 71.76 | 71.56 | 73.65
Evaluation data from https://github.com/UniversalDependencies/UD_Dutch release 2.0-test
Notes: None
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.85 | 99.83 | 99.84 |
Sentences | 95.30 | 97.66 | 96.47 |
Words | 99.85 | 99.83 | 99.84 |
UPOS | 91.74 | 91.71 | 91.72 | 91.87
XPOS | 88.71 | 88.68 | 88.69 | 88.84
Feats | 89.82 | 89.80 | 89.81 | 89.95
AllTags | 87.59 | 87.57 | 87.58 | 87.72
Lemmas | 90.01 | 89.99 | 90.00 | 90.14
UAS | 76.79 | 76.77 | 76.78 | 76.91
LAS | 70.95 | 70.93 | 70.94 | 71.05
CLAS | 63.92 | 63.11 | 63.51 | 63.22
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 97.30 | 99.02 | 98.15 |
Sentences | 84.91 | 91.97 | 88.30 |
Words | 97.30 | 99.02 | 98.15 |
UPOS | 76.81 | 78.16 | 77.48 | 78.94
XPOS | 86.71 | 88.24 | 87.47 | 89.11
Feats | 87.82 | 89.36 | 88.58 | 90.25
AllTags | 73.29 | 74.58 | 73.93 | 75.32
Lemmas | 69.16 | 70.38 | 69.77 | 71.08
UAS | 76.51 | 77.85 | 77.17 | 78.62
LAS | 70.35 | 71.59 | 70.97 | 72.30
CLAS | 63.02 | 64.52 | 63.76 | 65.57
Evaluation data from https://github.com/UniversalDependencies/UD_Spanish-Ancora release 2.0-test
Notes: None
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.96 | 99.96 | 99.96 |
Sentences | 98.62 | 99.30 | 98.96 |
Words | 99.95 | 99.94 | 99.94 |
UPOS | 98.10 | 98.10 | 98.10 | 98.16
XPOS | 98.10 | 98.10 | 98.10 | 98.16
Feats | 97.49 | 97.48 | 97.49 | 97.54
AllTags | 96.84 | 96.83 | 96.83 | 96.89
Lemmas | 98.09 | 98.08 | 98.08 | 98.14
UAS | 87.72 | 87.71 | 87.72 | 87.76
LAS | 84.60 | 84.59 | 84.59 | 84.64
CLAS | 78.93 | 78.75 | 78.84 | 78.84
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.23 | 99.69 | 99.46 |
Sentences | 98.44 | 99.24 | 98.84 |
Words | 98.88 | 98.99 | 98.93 |
UPOS | 94.23 | 94.34 | 94.28 | 95.30
XPOS | 96.96 | 97.08 | 97.02 | 98.06
Feats | 96.53 | 96.64 | 96.58 | 97.62
AllTags | 93.02 | 93.12 | 93.07 | 94.07
Lemmas | 80.20 | 80.29 | 80.24 | 81.11
UAS | 86.67 | 86.77 | 86.72 | 87.66
LAS | 83.96 | 84.06 | 84.01 | 84.92
CLAS | 78.85 | 78.56 | 78.70 | 80.06
Evaluation data from https://github.com/UniversalDependencies/UD_Portuguese release 2.0-test
Notes: spacy does not return morphological features resulting in incorrect evaluation numbers for spacy on Feats and AllTags
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.69 | 99.77 | 99.73 |
Sentences | 95.50 | 97.90 | 96.69 |
Words | 99.52 | 99.69 | 99.60 |
UPOS | 96.35 | 96.51 | 96.43 | 96.81
XPOS | 72.73 | 72.86 | 72.79 | 73.08
Feats | 93.35 | 93.51 | 93.43 | 93.80
AllTags | 71.64 | 71.76 | 71.70 | 71.98
Lemmas | 96.79 | 96.95 | 96.87 | 97.26
UAS | 86.58 | 86.73 | 86.65 | 87.00
LAS | 83.04 | 83.18 | 83.11 | 83.44
CLAS | 77.27 | 76.70 | 76.98 | 77.06
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 95.32 | 98.10 | 96.69 |
Sentences | 87.50 | 93.92 | 90.60 |
Words | 90.32 | 86.21 | 88.22 |
UPOS | 82.41 | 78.65 | 80.48 | 91.23
XPOS | 60.34 | 57.59 | 58.94 | 66.81
Feats | 30.47 | 29.09 | 29.76 | 33.74
AllTags | 24.23 | 23.13 | 23.66 | 26.83
Lemmas | 74.53 | 71.13 | 72.79 | 82.51
UAS | 72.49 | 69.19 | 70.80 | 80.26
LAS | 68.08 | 64.97 | 66.49 | 75.37
CLAS | 65.28 | 68.14 | 66.68 | 69.30
Evaluation data from https://github.com/UniversalDependencies/UD_Italian release 2.0-test
Notes: None
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.91 | 99.92 | 99.91 |
Sentences | 96.73 | 98.34 | 97.53 |
Words | 99.82 | 99.85 | 99.83 |
UPOS | 97.21 | 97.24 | 97.22 | 97.38
XPOS | 97.01 | 97.03 | 97.02 | 97.18
Feats | 96.99 | 97.01 | 97.00 | 97.16
AllTags | 96.10 | 96.13 | 96.12 | 96.28
Lemmas | 97.28 | 97.31 | 97.30 | 97.46
UAS | 88.90 | 88.92 | 88.91 | 89.06
LAS | 86.20 | 86.22 | 86.21 | 86.36
CLAS | 79.81 | 79.49 | 79.65 | 79.67
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 97.10 | 94.66 | 95.86 |
Sentences | 95.74 | 97.93 | 96.82 |
Words | 90.39 | 81.89 | 85.93 |
UPOS | 82.75 | 74.96 | 78.66 | 91.55
XPOS | 86.39 | 78.27 | 82.13 | 95.58
Feats | 86.96 | 78.78 | 82.66 | 96.20
AllTags | 81.78 | 74.09 | 77.75 | 90.48
Lemmas | 71.73 | 64.98 | 68.19 | 79.36
UAS | 70.98 | 64.30 | 67.47 | 78.52
LAS | 67.31 | 60.98 | 63.99 | 74.47
CLAS | 59.85 | 61.85 | 60.84 | 66.76
Evaluation data from https://github.com/UniversalDependencies/UD_English release 2.1
Notes:
- udpipe was trained on UD_English while spacy was trained on OntoNotes which is a different treebank than the udpipe model which makes comparison tricky
- spacy does not return morphological features + it seems that dependency relationships do not follow the same format as universaldependencies.org giving probably false evaluation metrics on UAS and LAS
- This all indicates that probably only the metrics on upos and xpos are relevant for comparison
- also removed sentence newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011 from the test dataset as it contained non-UTF-8 characters
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_udpipe.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.15 | 98.94 | 99.05 |
Sentences | 93.49 | 96.82 | 95.13 |
Words | 99.15 | 98.94 | 99.05 |
UPOS | 93.77 | 93.58 | 93.68 | 94.58
XPOS | 93.15 | 92.95 | 93.05 | 93.95
Feats | 94.59 | 94.40 | 94.50 | 95.41
AllTags | 91.81 | 91.61 | 91.71 | 92.59
Lemmas | 96.16 | 95.96 | 96.06 | 96.99
UAS | 83.09 | 82.91 | 83.00 | 83.80
LAS | 79.97 | 79.80 | 79.88 | 80.65
CLAS | 76.20 | 75.95 | 76.07 | 76.81
> system("python evaluation_script/conll17_ud_eval.py -v gold.conllu predictions_spacy.conllu")
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 96.71 | 98.26 | 97.48 |
Sentences | 87.70 | 94.46 | 90.96 |
Words | 96.71 | 98.26 | 97.48 |
UPOS | 79.93 | 81.21 | 80.56 | 82.65
XPOS | 89.60 | 91.04 | 90.31 | 92.65
Feats | 32.51 | 33.03 | 32.77 | 33.62
AllTags | 27.65 | 28.09 | 27.87 | 28.59
Lemmas | 85.79 | 87.17 | 86.48 | 88.72
UAS | 56.00 | 56.90 | 56.44 | 57.90
LAS | 42.46 | 43.15 | 42.80 | 43.91
CLAS | 36.69 | 43.30 | 39.72 | 44.14
Not executed as the spacy model is built on a different treebank, which will give similar remarks as encountered in the English evaluation.