This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked against original introducing paper implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- Glue
- SuperGlue
- CoQA
- DROP
-
Lambada - Lambada (Cloze variants)
-
Lambada (Multilingual) - Wikitext
- PiQA
- PROST
- MCTACO
- Pubmed QA
- SciQ
- QASPER
- QA4MRE
- TriviaQA
- AI2 ARC
- LogiQA
- HellaSwag
- SWAG
- OpenBookQA
- SQuADv2 (Lintang)
- RACE
- HeadQA
- MathQA
- WebQs
- WSC273
- Winogrande
- ANLI
- Hendrycks Ethics (missing some tasks/metrics, see PR 660: #660 for more info)
- TruthfulQA (mc1)
- TruthfulQA (mc2)
- TruthfulQA (gen)
- MuTual
- Hendrycks Math (Hailey)
- Asdiv
- GSM8k
- Arithmetic
- MMMLU (Hailey)
- Translation (WMT) suite
- Unscramble
-
Pile (perplexity) - BLiMP
- ToxiGen
- StoryCloze
- NaturalQs (Hailey)
- CrowS-Pairs
- XCopa
- BIG-Bench (Hailey)
- XStoryCloze
- XWinograd
- PAWS-X
- XNLI
- MGSM
- SCROLLS
- Babi
- Belebele
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed against the original task's implementation or published results introducing the task.
- TheoremQA
- Theorem Proving evaluations
- Chain of Thought
- Self-consistency ; Least-to-Most prompting, etc.
- Summarization Tasks
- Anthropic Model-Written Evals