NLP evaluation is in
trouble! Many evaluation benchmarks have been found in pre-training datasets compromising
scientific results. The LM
Contamination Index is a manually created database of contamination evidences for LMs. Please,
refer to the blog post or the repository
for more information. The
table below shows the following information:
-
The dataset is Contaminated if evidence of contamination has been found:
- The dataset was found in the pre-training data. In this case, the contamination
percentage is also reported.
- A model trained on the corpus is able to generate dataset examples.
-
The dataset is Suspicious if there are signs of
contamination, the model is aware of some detail or structure of the dataset, but no clear evidence
was found.
-
We consider the dataset to be Clean if no
evidence nor signs of contamination have been found.
-
If an specific split of a dataset is not publicly available we use the label N/A.
-
The lack of label means that no experiment was performed.
The source indicates whether the information comes from user reports in the repository or from a paper.