GitHub - XChen-Zero/OneEval: OneEval: Open EvalScope evaluation artifacts for LLMs — subset breakdowns, pass@k curves, and reproducible evaluation protocols.

Motivation

Open LLM evaluation results are frequently hard to audit and difficult to reproduce. Two recurring gaps appear across benchmark reports and evaluation framework issue threads:

the exact evaluation protocol (sampling knobs, run repetitions, and runtime assumptions) is often underspecified
rich benchmarks are commonly reduced to headline aggregates, obscuring subset-level behavior (e.g., MMLU-Pro domains) or multi-sample behavior (e.g., pass@k curves)

OneEval addresses these gaps by publishing the evidence needed for inspection: the launch code used for the suite, a sanitized protocol summary, and the detailed result slices that explain how an overall number is composed.

Scope

This repository is not a leaderboard and does not publish a composite score. The emphasis is on artifact release and auditability rather than ranking.

What Is Released

site/: the static website (GitHub Pages), organized by benchmark type
published_results/: the public result tree, organized as models/<model>/<benchmark>/<mode>/<run_id>/
site/data/: site-side data bundles, derived from the public results
evaluation_code/: launch scripts and targeted monkey patches used for the runs

Reading Guide (Website)

The website organizes benchmarks into four reading tracks with benchmark-specific views:

Knowledge
Agentic
IF (Instruction Following)
Reasoning

The tables prioritize academic readability:

QA-style benchmarks expose Correct / Incorrect / Abstain explicitly
subset-heavy benchmarks use an overall table with subset drilldowns
pass@k benchmarks provide milestone summaries (k=1/8/32/64) and an interactive curve view

Benchmarks In This Release

Current benchmark set (as published in the site data):

Knowledge: chinese_simpleqa, gpqa_diamond, mmlu_pro, simple_qa, super_gpqa
Agentic: bfcl_v3
IF: ifeval
Reasoning: aime24, aime25, hmmt25, zebralogicbench

Evaluation Stack

The evaluation stack used for the published artifacts:

EvalScope: 1.4.1
BFCL Eval: 2026.2.9
Qwen3 series and Llama family local serving: sglang 0.5.6
Qwen3 series and Llama family agent tooling: qwen_agent 0.0.31
Qwen3.5 series inference path: DashScope-compatible API
Qwen3.5 series agent tooling: qwen_agent 0.0.34

Operational notes (as reflected in the published protocol summary on the site):

Qwen3 family models, including DeepSeek-R1-Qwen3 variants, are evaluated under the unified sampling protocol documented in the site.
Llama-family runs use fixed single-repeat settings where the sampling configuration is deterministic.
BFCL v3 agentic evaluation uses a YaRN-extended context setup to 131072 where applicable.

Notes On Use

This repository ships a materialized public release (published_results/ and site/data/). The public repository is intended for inspection, browsing, and reuse of the released artifacts.

To preview the site locally:

.venv/bin/python -m http.server 8000

Then open http://localhost:8000/site/.

Citation

If you use OneEval in a report or derivative analysis, please cite the repository:

@misc{oneeval,
  title        = {OneEval: Open-model evaluation artifacts},
  author       = {Chen, Xuan and Chen, Qiuxuan and Liu, Bo},
  year         = {2026},
  howpublished = {GitHub repository},
  note         = {Accessed: 2026-03-03},
  url          = {https://github.com/XChen-Zero/OneEval/}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
assets		assets
evaluation_code		evaluation_code
published_results/models		published_results/models
site		site
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Scope

What Is Released

Reading Guide (Website)

Benchmarks In This Release

Evaluation Stack

Notes On Use

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Motivation

Scope

What Is Released

Reading Guide (Website)

Benchmarks In This Release

Evaluation Stack

Notes On Use

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages