LM Contamination Index

NLP evaluation is in trouble! Many evaluation benchmarks have been found in pre-training datasets compromising scientific results. The LM Contamination Index is a manually created database of contamination evidences for LMs. Please, refer to the blog post or the repository for more information. The table below shows the following information:

The dataset is Contaminated if evidence of contamination has been found:
- The dataset was found in the pre-training data. In this case, the contamination percentage is also reported.
- A model trained on the corpus is able to generate dataset examples.
The dataset is Suspicious if there are signs of contamination, the model is aware of some detail or structure of the dataset, but no clear evidence was found.
We consider the dataset to be Clean if no evidence nor signs of contamination have been found.
If an specific split of a dataset is not publicly available we use the label N/A.
The lack of label means that no experiment was performed.

The source indicates whether the information comes from user reports in the repository or from a paper.