You need to agree to share your contact information to access this dataset
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By clicking “Access repository” below, you confirm your understanding that this resource is permitted for use as a test set, but not as a training set, and should not be uploaded to the internet where web-crawlers can access it (such as plain-text in github, or in an academic PDF). Please ensure adherence to the terms detailed in the paper. If you are unsure about your specific case, don't hesitate to contact: [email protected].
Log in or Sign Up to review the conditions and access this dataset content.
CoverBench: A Challenging Benchmark for Complex Claim Verification
Link: https://arxiv.org/abs/2408.03325
Abstract: There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom.
This dataset is derived from a collection of other datasets (see paper) for which we generated claims for verification using models. Each example includes which model generated it when applicable. The original data from the source datasets is subject to the dataset's original license.
When citing our work, please cite the 9 source datasets we used as well!
Important Update
On 5/Sep/2024 the CoverBench data file was updated to reflect fixes. I was notified about an error in the PubMedQA part of the data. Approximately 40 to 50 examples were affected due to a simple bug in our preparation - as of the update, the data file should be correct. Sorry!
Usage
To load the dataset:
! pip install datasets
from datasets import load_dataset
coverbench = load_dataset("google/coverbench")['eval']
This is an evaluation benchmark. It should not be included in training data for NLP models.
Please do not redistribute any part of the dataset without sufficient protection against web-crawlers.
An identifier 64-character string is added to each instance in the dataset to assist in future detection of contamination in web-crawl corporta.
The CoverBench dataset's string is: CoverBench:hEBhLMcvwQFuAjcV94zZuPS5iWJp8zv1cEywyEwHKWfGrIKiXodDRcjRY4PtbgwZ
- Downloads last month
- 130