-
BLEU
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of meeting of the association for computational linguistics (pp. 311–318).
BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.
-
ROUGE-L
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Proceedings of meeting of the association for computational linguistics (pp. 74–81).
ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
-
METEOR
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of meeting of the association for computational linguistics (pp. 65–72).
The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
-
CIDEr
Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 4566–4575).
-
SPICE
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. In Proceedings European conference on computer vision (pp. 382–398).
Please check salaniz/pycocoevalcap for installation of pycocotools
and pycocoevalcap
.
Evaluation codes for MS COCO caption generation.
This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset.
The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption. Caption evaluation depends on the COCO API that natively supports Python 3.
- Java 1.8.0
- Python 3.6
To install
pycocoevalcap
and thepycocotools
dependency (https://github.com/cocodataset/cocoapi), run:pip install pycocoevalcap
- SPICE requires the download of Stanford CoreNLP 3.6.0 code and models. This will be done automatically the first time the SPICE evaluation is performed.
- Note: SPICE will try to create a cache of parsed sentences in ./spice/cache/. This dramatically speeds up repeated evaluations. The cache directory can be moved by setting 'CACHE_DIR' in ./spice. In the same file, caching can be turned off by removing the '-cache' argument to 'spice_cmd'.
This repo is mainly based on the code from pycocotools
and pycocoevalcap
, which is designed for evaluation of MS COCO caption generation. Here the API was simplified, we can transfer the use of this evaluation tool to other caption datasets, such as Flickr8k, Flickr30k or any other else.
There are 2 json
file saving the references and candidate captions were required in example/
. And example/main.py
would read these 2 json
files and evaluate the scores automatically, then print them.
The references.json
and captions.json
(candidate captions) were shown in examples/
. In order to generate these files, please check the demo below:
# Collect all references from dataset as references: dict
# Collect all captions generated by model as captions: dict
references = {
"1": ["this is a tree", "this is an apple", ...],
"2": ["a man is sitting", "a man in the street", ...],
//......
}
captions = {
"1": ["this is a big tree"],
"2": ["a man is sitting"],
......
}
# Save them as correct json files
import json
new_cap = []
for k, v in captions.items():
new_cap.append({'image_id': k, 'caption': v[0]})
new_ref = {'images': [], 'annotations': []}
for k, refs in references.items():
new_ref['images'].append({'id': k})
for ref in refs:
new_ref['annotations'].append({'image_id': k, 'id': k, 'caption': ref})
with open('references.json', 'w') as fgts:
json.dump(new_gts, fgts)
with open('captions.json', 'w') as fres:
json.dump(new_res, fres)
Then we can check the saved references.json
and captions.json
if it is the same format as demo references_example.json
and captions_example.json
:
references.json
{
"images": [
{"id": "0"},
{"id": "1"},
......
],
"annotations": [
{
"image_id": "0",
"id": "0",
"caption": "A man with a red helmet on a small moped on a dirt road. "
},
{
"image_id": "0",
"id": "0",
"caption": "Man riding a motor bike on a dirt road on the countryside."
},
{
"image_id": "0",
"id": "0",
"caption": "A man riding on the back of a motorcycle."
},
{
"image_id": "0",
"id": "0",
"caption": "A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. "
},
{
"image_id": "0",
"id": "0",
"caption": "A man in a red shirt and a red hat is on a motorcycle on a hill side."
},
{
"image_id": "1",
"id": "1",
"caption": "A woman wearing a net on her head cutting a cake. "
},
{
"image_id": "1",
"id": "1",
"caption": "A woman cutting a large white sheet cake."
},
{
"image_id": "1",
"id": "1",
"caption": "A woman wearing a hair net cutting a large sheet cake."
},
{
"image_id": "1",
"id": "1",
"caption": "there is a woman that is cutting a white cake"
},
{
"image_id": "1",
"id": "1",
"caption": "A woman marking a cake with the back of a chef's knife. "
},
......
]
}
captions.json
[
{
"image_id": "0",
"caption": "a man standing on the side of a road ."
},
{
"image_id": "1",
"caption": "a person standing in front of a mirror ."
},
......
]
Then use command in example/
to run main.py
:
python main.py
Terminal output:
>>
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 72388 tokens at 846674.96 tokens per second.
PTBTokenizer tokenized 12514 tokens at 290819.68 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 10476, 'reflen': 10274, 'guess': [10476, 9476, 8476, 7476], 'correct': [7043, 3379, 1518, 669]}
ratio: 1.0196612809031516
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
computing METEOR score...
METEOR: 0.201
computing Rouge score...
ROUGE_L: 0.472
computing CIDEr score...
CIDEr: 0.457
computing SPICE score...
Parsing reference captions
Initiating Stanford parsing pipeline
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
done [0.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.8 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.3 sec].
Threads( StanfordCoreNLP ) [01:03.436 minutes]
Parsing test captions
Threads( StanfordCoreNLP ) [3.322 seconds]
SPICE evaluation took: 1.182 min
SPICE: 0.137
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
METEOR: 0.201
ROUGE_L: 0.472
CIDEr: 0.457
SPICE: 0.137