Skip to content

Latest commit

 

History

History
74 lines (69 loc) · 1.86 KB

README.md

File metadata and controls

74 lines (69 loc) · 1.86 KB

v1.0 Tasks

This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.

Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked against original introducing paper implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.

  • Glue
  • SuperGlue
  • CoQA
  • DROP
  • Lambada
  • Lambada (Cloze variants)
  • Lambada (Multilingual)
  • Wikitext
  • PiQA
  • PROST
  • MCTACO
  • Pubmed QA
  • SciQ
  • QASPER
  • QA4MRE
  • TriviaQA
  • AI2 ARC
  • LogiQA
  • HellaSwag
  • SWAG
  • OpenBookQA
  • SQuADv2 (Lintang)
  • RACE
  • HeadQA
  • MathQA
  • WebQs
  • WSC273
  • Winogrande
  • ANLI
  • Hendrycks Ethics (missing some tasks/metrics, see PR 660: #660 for more info)
  • TruthfulQA (mc1)
  • TruthfulQA (mc2)
  • TruthfulQA (gen)
  • MuTual
  • Hendrycks Math (Hailey)
  • Asdiv
  • GSM8k
  • Arithmetic
  • MMMLU (Hailey)
  • Translation (WMT) suite
  • Unscramble
  • Pile (perplexity)
  • BLiMP
  • ToxiGen
  • StoryCloze
  • NaturalQs (Hailey)
  • CrowS-Pairs
  • XCopa
  • BIG-Bench (Hailey)
  • XStoryCloze
  • XWinograd
  • PAWS-X
  • XNLI
  • MGSM
  • SCROLLS
  • Babi
  • Belebele

Novel Tasks

Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed against the original task's implementation or published results introducing the task.

Task Wishlist

  • TheoremQA
  • Theorem Proving evaluations
  • Chain of Thought
  • Self-consistency ; Least-to-Most prompting, etc.
  • Summarization Tasks
  • Anthropic Model-Written Evals