Releases · modelscope/evalscope

@RinZ27

中文版

基准测试数据集

新增 AIME 2026 数学竞赛基准测试
新增 MMMLU 多语言大规模多任务理解基准测试
新增 LongBench v2 长文本理解基准测试

功能增强

性能测试: 新增获取基准端点（get benchmark endpoint）功能，修复测试连接参数配置
性能测试: 优化 SLA 自动调优（SLA auto tune）功能
评测服务: 支持以表格形式返回评测结果，修复分析统计相关问题
Judge 模型: 支持为 Judge LLM 配置 model_args 参数
请求追踪: 支持打印 request id，便于请求追踪与调试

问题修复

修复 eval() 安全性问题，替换为 ast.literal_eval() 处理字符串参数解析
修复性能测试（perf）tokenize 及空子集（empty subset）相关问题
修复数据集 shuffle 随机性问题，使用带种子的 random.Random 确保可复现性

English Version

Benchmark Datasets

Added AIME 2026 math competition benchmark
Added MMMLU (Multilingual Massive Multitask Language Understanding) benchmark
Added LongBench v2 for long-context understanding evaluation

Feature Enhancements

Performance Testing: Added get benchmark endpoint and fixed test connection parameter configuration
Performance Testing: Added SLA auto-tune functionality
Evaluation Service: Support returning results in table format and fixed analysis bugs
Judge Model: Added model_args support for judge LLM configuration
Request Tracking: Added request ID printing for better request tracing and debugging

Bug Fixes

Fixed security issue by replacing eval() with ast.literal_eval() for string argument parsing
Fixed perf tokenize and empty subset related issues
Fixed dataset shuffling reproducibility by using seeded random.Random

What's Changed

Replace eval() with ast.literal_eval() in ParseStrArgsAction by @RinZ27 in #1221
[Fix] update kontext_bench, refcoco by @Yunnglin in #1229
[Benchmark]Add aime26 and SLA auto tune by @Yunnglin in #1230
[Fix] perf test connection args and add get benchmark endpoint by @Yunnglin in #1232
[Benchmark] add mmmlu by @Yunnglin in #1235
[Fix]perf tokenize and empty subset by @Yunnglin in #1236
[Benchmark] Add longbench_v2 by @Yunnglin in #1237
[Feature] Return table for service and fix analysis bug by @Yunnglin in #1240
Add model_args for judge LLM by @haihongtran in #1241
print request id by @strenuous-life in #1242
[Fix] Use seeded random.Random for dataset shuffling by @pcabriada in #1243

New Contributors

@RinZ27 made their first contribution in #1221
@haihongtran made their first contribution in #1241
@strenuous-life made their first contribution in #1242
@pcabriada made their first contribution in #1243

Full Changelog: v1.5.0...v1.5.1

@jerryldh

中文版

基准测试数据集

数学评测: 新增 HMMT25 数学基准测试
代码评测: 新增 CL-bench (腾讯) 基准测试
修复 LiveCodeBench 代码提取逻辑，改为使用最后一个代码块

功能增强

Judge LLM 类型指定: 支持在评测中指定 Judge LLM 的类型
火山引擎沙箱支持: 新增 Volcengine 沙箱环境支持
Anthropic API 支持: 新增 Anthropic API 接入能力
性能测试进度条: 为 perf 数据集处理新增 tqdm 进度条显示
统一子集更新: 更新 unify subset 相关逻辑
服务端 Demo 更新: 优化 server demo 展示

文档优化

新增基准测试详情文档说明
更新 eval_type 相关文档
更新文档同步脚本

问题修复

修复空数据集跳过处理问题
修复 rate type 更新问题
修复 input_audio 错误前缀问题 (Issue #1152)

English Version

Benchmark Datasets

Math Evaluation: Added HMMT25 math benchmark
Code Evaluation: Added CL-bench (Tencent) benchmark
Fixed LiveCodeBench code extraction to use the last fenced code block

Feature Enhancements

Judge LLM Type: Added support for specifying judge LLM type in evaluation
Volcengine Sandbox: Added Volcengine sandbox environment support
Anthropic API: Added Anthropic API integration support
Performance Testing Progress Bar: Added tqdm progress display for perf dataset processing
Unified Subset Update: Updated unify subset related logic
Server Demo Update: Optimized server demo presentation

Documentation

Added benchmark detail documentation
Updated eval_type related documentation
Updated doc sync script

Bug Fixes

Fixed empty dataset skipping issue
Fixed rate type update issue
Fixed input_audio wrong prefix issue (Issue #1152)

What's Changed

feat: specify judge llm type by @jerryldh in #1157
feat(benchmark): add HMMT25 math benchmark by @XChen-Zero in #1154
[Fix] skip empty dataset by @Yunnglin in #1161
feat: add Volcengine sandbox support by @XChen-Zero in #1160
[Fix] update rate type by @Yunnglin in #1166
[Feature] add perf dataset tqdm by @Yunnglin in #1168
Fix LiveCodeBench extraction: use last fenced code block by @koshieguchi in #1170
[Doc] benchmarks detail doc by @Yunnglin in #1180
[Fix] fix input_audio wrong prefix (#1152) by @labAxiaoming in #1174
feat/Add CL-bench (tencent/CL-bench) benchmark by @XChen-Zero in #1191
Add anthropic api by @Yunnglin in #1202
update sync doc script by @Yunnglin in #1205
[Feature] Update server demo by @Yunnglin in #1215
[Doc] update eval_type by @Yunnglin in #1217
[Feature] update unify subset by @Yunnglin in #1220

New Contributors

@jerryldh made their first contribution in #1157
@XChen-Zero made their first contribution in #1154
@koshieguchi made their first contribution in #1170
@labAxiaoming made their first contribution in #1174

Full Changelog: v1.4.2...v1.5.0

@Yunnglin

中文版

基准测试数据集

代码评测: 新增 HumanEvalPlus、MBPPPlus 等代码能力评测

功能增强

性能测试: 新增对 Embedding 和 Rerank 模型的性能测试支持

文档优化

新增 general_fc 最佳实践文档
更新 collection 相关文档说明，支持自定义构建评测指数（index），参考使用文档
更新性能测试文档，新增 Embedding 和 Rerank 模型评测说明

问题修复

修复性能测试日志输出问题
修复 SimpleVQA 图像加载问题

English Version

Benchmark Datasets

Code Evaluation: Added HumanEvalPlus and MBPPPlus for code capability assessment

Feature Enhancements

Performance Testing: Added support for Embedding and Rerank models performance evaluation

Documentation

Added general_fc best practice documentation
Updated collection documentation with support for custom index construction
Updated performance testing documentation with Embedding and Rerank model evaluation instructions

Bug Fixes

Fixed performance testing log output issues
Fixed SimpleVQA image loading issues

What's Changed

[Doc] Add general_fc best practice by @Yunnglin in #1130
[Doc] update index collection doc by @Yunnglin in #1132
[Fix] update perf log by @Yunnglin in #1135
[Draft] feat(perf): add support for embedding and rerank models by @gbdjxgp in #1140
add humanevalplus and mbppplus benchmarks by @mushenL in #1144
[Doc]update perf embedding and rerank by @Yunnglin in #1147
[fix] simplevqa image load by @Yunnglin in #1153

Full Changelog: v1.4.1...v1.4.2

@Yunnglin

中文版

基准测试数据集

命名实体识别: 新增 12 个 NER（命名实体识别）数据集
语音识别: 新增 TORGO 数据集，用于构音障碍语音识别评测，支持 SemScore 评估
多模态评测: 新增 RefCOCO 基准测试
代码评测: 新增 Terminal-bench 终端命令能力评测

功能增强

性能测试: 新增 SLA 自动调优功能，优化性能测试体验
服务模式: 新增异步服务支持和 Gradio UI 界面
数据加载: 优化本地 JSONL 数据集加载功能

问题修复

修复 HallusionBench 数据加载问题
修复流式响应解析中的 SSE 分块处理问题

English Version

Benchmark Datasets

Named Entity Recognition: Added 12 NER (Named Entity Recognition) datasets
Speech Recognition: Added TORGO dataset for dysarthria speech recognition with SemScore evaluation
Multimodal Evaluation: Added RefCOCO referring expression comprehension benchmark
Code Evaluation: Added Terminal-bench for terminal command capability assessment

Feature Enhancements

Performance Testing: Added SLA auto-tuning functionality to optimize performance testing experience
Service Mode: Added asynchronous service support and Gradio UI interface
Data Loading: Optimized local JSONL dataset loading functionality

Bug Fixes

Fixed HallusionBench data loading issues
Fixed SemScore computation errors
Fixed eval_config loading related issues

What's Changed

[Fix] hallusion_bench load data by @Yunnglin in #1092
[Feature] Add perf SLA auto tune by @Yunnglin in #1095
[Feature] add service async and gradio ui by @Yunnglin in #1103
fix(streaming): Robust parsing of SSE chunks with multiple events and \r\n normalization by @amumu96 in #1102
Add 12 NER Datasets by @penguinwang96825 in #1106
[Benchmark] Add TORGO Dataset for Dysarthria Speech Recognition with SemScore Evaluation by @penguinwang96825 in #1107
[Benchmark] Add RefCOCO by @mushenL in #1109
[Fix] computation error in SemScore by @penguinwang96825 in #1110
[Feature] Update load local jsonl by @Yunnglin in #1111
[Fix] eval_config load by @Yunnglin in #1116
[Benchmark] Add terminal-bench by @Yunnglin in #1114

New Contributors

@amumu96 made their first contribution in #1102

Full Changelog: v1.4.0...v1.4.1

@Yunnglin

中文版

基准测试数据集

通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
代码评测: 新增 MultiplE、MBPP 等代码能力评测
语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

性能测试可视化: 新增 ClearML 可视化支持，优化性能测试(perf)监控能力
服务API: 新增 service api 功能，提供更灵活的服务调用方式，参考文档
懒加载模型: 新增 lazy model 支持，优化模型加载机制
重试机制: 新增 retry function，提升评测稳定性
沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
随机算法优化: 更新性能测试随机算法，提升测试准确性
UI增强: Dashboard 支持 HTTP params 参数配置
进度条优化: 更新 tqdm 进度显示机制

文档优化

更新自定义 VQA 相关文档
更新参数配置相关文档
更新基准测试(benchmarks)文档
更新服务(service)相关文档
更新 MTEB 相关链接

问题修复

修复 --analysis-report、--dataset-dir 等命令行参数问题
修复并发为1时的令牌吞吐量计算问题
修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
Service API: Added service api functionality for more flexible service invocation
Lazy Model Loading: Added lazy model support to optimize model loading mechanism
Retry Mechanism: Added retry function to improve evaluation stability
Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
Random Algorithm: Updated performance testing random algorithm for improved accuracy
UI Enhancement: Dashboard now supports HTTP params parameter configuration
Progress Bar: Updated tqdm progress display mechanism

Documentation

Updated custom VQA documentation
Updated parameter configuration documentation
Updated benchmarks documentation
Updated service documentation
Updated MTEB related links

Bug Fixes

Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
Fixed token throughput calculation at concurrency 1
Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
Fixed SWE-bench image build and MRCR leading newline support
Fixed NLTK resource checking issues

What's Changed

[Feature] Add perf ClearML visualization by @Yunnglin in #1032
[Doc] update custom vqa by @Yunnglin in #1036
[Benchmark ]Add eq bench by @Yunnglin in #1037
Feature/zebralogicbench by @nhes in #1035
[Fix] Update tau2 by @Yunnglin in #1039
[Feature] Add service api by @Yunnglin in #1042
fix --analysis-report=true bug by @pumpkin12135 in #1046
[feature] add lazy model by @Secbone in #1045
[Doc] update parameter by @Yunnglin in #1048
[Fix] update default work dir by @Yunnglin in #1049
[Feature] Update perf random Algorithm by @Yunnglin in #1050
[Feature] add retry function by @Yunnglin in #1051
Fix --dataset-dir parameter to work correctly by @gbdjxgp in #1053
[Fix] chartqa prompt by @Yunnglin in #1054
[Benchmark] Add fleurs, librispeech by @Yunnglin in #1059
[Fix] multi-if load by @Yunnglin in #1062
[Benchmakr] Add MultiplE-mbpp, MBPP by @Yunnglin in #1066
Update mteb link by @Samoed in #1065
UI dashboard supports HTTP params parameters by @pumpkin12135 in #1060
[Feature] update sandbox with pool and multiple-humaneval by @Yunnglin in #1073
check_nltk_data does not accept a parameter by @Zhaoyi-Yan in #1071
[Fix] update check nltk resource by @Yunnglin in #1078
[Fix] omni doc bench load by @Yunnglin in #1079
[Doc] update benchmarks doc by @Yunnglin in #1081
[Doc] update service and doc by @Yunnglin in #1085
fix: Output token throughput and Total token throughput on Concurrency 1 by @cdpath in #1083
[Fix] SWE build image by @Yunnglin in #1087
small fixes to mrcr to support leading \n characters by @sophies-cerebras in #1086
[Feature] Update tqdm process by @Yunnglin in #1089

New Contributors

@nhes made their first contribution in #1035
@pumpkin12135 made their first contribution in #1046
@Secbone made their first contribution in #1045
@gbdjxgp made their first contribution in #1053
@Samoed made their first contribution in #1065
@Zhaoyi-Yan made their first contribution in #1071
@cdpath made their first contribution in #1083

Full Changelog: v1.3.0...v1.4.0

@Yunnglin

中文版

基准测试数据集

多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

自定义工具调用评测: 支持自定义函数调用(function-call)评测能力，参考使用文档
自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持，参考使用文档
聚合评分: 更新聚合(agg)参数，优化评分聚合机制
性能测试: 优化性能测试(perf)相关参数配置

文档优化

更新 collection 相关文档说明，支持自定义构建评测指数（index），参考使用文档

问题修复

修复 perf completion endpoint streaming 相关问题
修复 judge model 错误日志显示问题
修复 --no-test-connection 参数 action 问题
修复函数调用类测试用例错误处理问题(Issue #1005)
修复 model args 相关问题

English Version

Benchmark Datasets

Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

Custom Evaluation: Added support for custom function-call evaluation
Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

Updated eval_type related documentation
Updated collection documentation

Bug Fixes

Fixed perf completion endpoint streaming issues
Fixed error log display for judge model
Fixed --no-test-connection parameter action issue
Fixed error handling for function-call test cases (Issue #1005)
Fixed model args related issues

What's Changed

[Doc] Update doc eval_type by @Yunnglin in #970
[Benchmark] Add A_OKVQA, CMMU, ScienceQ, V*Bench by @mushenL in #973
[Benchmark] Add SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini by @Yunnglin in #976
[Feature] Add custom function-call eval by @Yunnglin in #982
[Fix] perf completion endpoint streaming by @Yunnglin in #983
[Fix] fix error log of judge model by @Yunnglin in #986
add openai mrcr by @sophies-cerebras in #987
[Feature] Add extra param spec by @Yunnglin in #990
Add gsm8k_v,mgsm and micro_vqa benchmarks by @mushenL in #995
fix: fix --no-test-connection args action by @ljwh in #999
Update collection doc by @Yunnglin in #997
[Benchmark] Add IFBench by @Yunnglin in #1001
解决Issue1005 处理函数调用类测试用例错误问题 by @hougedengwo in #1007
[Benchmark] Add SciCode by @Yunnglin in #1011
[Fix] update perf args by @Yunnglin in #1013
[Fix] model args by @Yunnglin in #1014
[Feature] update agg args by @Yunnglin in #1016
[Feature] Add custom VQA by @Yunnglin in #1019
[Benchmark] add CMMMU by @Yunnglin in #1020

New Contributors

@ljwh made their first contribution in #999
@hougedengwo made their first contribution in #1007

Full Changelog: v1.2.0...v1.3.0

@penguinwang96825

中文版

基准测试数据集

新增多个MCQA（多项选择问答）数据集
新增Drivelology基准测试
更新BFCL-v3，新增支持BFCL-v4基准测试
更新tau-bench，新增支持tau2-bench
支持WMT机器翻译评测和相关指标

功能增强

优化答案提取机制 - 使答案提取过程更加明确和可控
支持batch计算指标，例如Bertscore等
更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
更新OpenAI API参数 - 优化API调用参数配置

数据源更新

更新SimpleQA数据源 - 使用最新的SimpleQA数据
对齐AIME到AA标准 - 统一评测标准
更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

修复DROP数据集few_shot_num=3的问题
修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

Added multiple MCQA (Multiple Choice Question Answering) datasets
Added Drivelology benchmark
Updated BFCL-v3 and added support for BFCL-v4 benchmark
Updated tau-bench and added support for tau2-bench
Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
Added support for batch metric computation, such as Bertscore
Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

Updated SimpleQA data source - using the latest SimpleQA data
Aligned AIME to AA standard - unified evaluation standards
Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

Fixed the issue with DROP dataset when few_shot_num=3
Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

[Benchmark] Add MCQA datasets by @penguinwang96825 in #923
[Benchmark] Add Drivelology benchmark by @penguinwang96825 in #927
[Benchmark] Add more MCQA datasets by @penguinwang96825 in #928
[Benchmark] Add BFCL-v4 by @Yunnglin in #934
fix (dorp allow few_shot_num=3 in dataset args) 当前存在few_shot_num=3时，会… by @yuhuan0311 in #940
[Fix] update DROP metric by @Yunnglin in #941
[Feature] Update Bertscore for DrivelologyNarrativeWriting by @Yunnglin in #935
Update SimpleQA source by @Yunnglin in #948
[Feature] Update OpenAI API parameter by @Yunnglin in #949
[Fix] decode buffer error by @Yunnglin in #954
feat: update WMT adapters and related metrics by @Epsilon617 in #938
[Benchmark] Update tau-bench and tau2-bench by @Yunnglin in #959
[Fix] Update mmlu-pro by @Yunnglin in #960
[Doc] Fixed the configuration error in BFCL-v4 documentation example (#962) by @Tsumugii24 in #963
align aime to AA by @sophies-cerebras in #965
[ADD] Implement metric aggregation pass@k and vote@k #387 by @xin8coder in #964
[Feature] make extract answer explict by @Yunnglin in #966
[Feature] Update aggregate_scores by @Yunnglin in #967

New Contributors

@yuhuan0311 made their first contribution in #940
@Epsilon617 made their first contribution in #938
@Tsumugii24 made their first contribution in #963
@xin8coder made their first contribution in #964

Full Changelog: v1.1.1...v1.2.0

@Aktsvigun

更新

基准测试扩展

视觉/多模态评测：HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
文档理解： OmniDocBench
NLP任务： CoNLL2003、NER 任务集合（9个任务）、AA-LCR
逻辑推理： VisuLogic、ZeroBench

功能增强

性能基准测试优化：perf 功能优化，可获得与 vLLM benchmarking 相媲美的测试结果，参考使用文档
代码评测环境增强：沙箱环境支持本地/远程双模式运行，提升代码安全性与灵活性，参考使用文档

性能与稳定性优化

修复数据集中 prompt tokens 计算问题
增加评测过程中心跳检测机制
修复 GSM8K 准确率计算并增强日志记录

系统要求更新

Python版本要求：提升至 ≥3.10 （无依赖更新）

Updates

Benchmark Extensions

Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
Document Understanding: OmniDocBench
NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
Logic Reasoning: VisuLogic, ZeroBench

Feature Enhancements

Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation

Performance and Stability Improvements

Fixed prompt tokens calculation issues in datasets
Added heartbeat detection mechanism during evaluation process
Fixed GSM8K accuracy calculation and enhanced logging

System Requirements Update

Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

Datasets: prompt tokens count bug fixed by @Aktsvigun in #873
[Benchmark] Add HallusionBench and POPE by @Yunnglin in #875
[Feature] Add inflight process by @Yunnglin in #880
[Benchmark] Add PloyMath by @Yunnglin in #882
add math_verse math_vision simple_vqa by @mushenL in #881
fix: update Python version requirement to >=3.10 by @nowang6 in #890
[Feature] Update perf thoughput by @Yunnglin in #894
[Feature] Add extra query by @Yunnglin in #895
add AA-LCR benchmark to evalscope by @sophies-cerebras in #897
[feature] add --visualizer parameter instead of --XXX_api_key in stress test by @ShaohonChen in #878
[Feature] Add sandbox doc by @Yunnglin in #899
fix gsm8k acc and add more log by @ms-cs in #903
[Doc] Update writing by @Yunnglin in #904
[Benchmark] Add OmniDocBench by @Yunnglin in #908
[Benchmark] Add CoNLL2003 benchmark by @penguinwang96825 in #912
add seed_bench_2_plus,visu_logic_adapter,zerobench by @mushenL in #916
[Benchmark] Add NER suite by @penguinwang96825 in #921
[Feature] Add pred heartbeat by @ms-cs in #922

New Contributors

@Aktsvigun made their first contribution in #873
@nowang6 made their first contribution in #890
@sophies-cerebras made their first contribution in #897
@ms-cs made their first contribution in #903
@penguinwang96825 made their first contribution in #912

Full Changelog: v1.1.0...v1.1.1

@Yunnglin

更新

支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准，所有支持的数据集请参考
编写Qwen3-Omni和Qwen3-VL模型评测最佳实践
支持pyproject.toml安装

Update

The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
Installation via pyproject.toml is now supported.

What's Changed

[Doc] Add qwen omni doc by @Yunnglin in #854
[Fix] Fix bfcl_v3 validation by @Yunnglin in #858
[Feature] Add pyproject.toml by @Yunnglin in #857
[Benchmark] Add ChartQA and BLINK by @Yunnglin in #861
[Benchmark] Add DocVQA and InfoVQA by @Yunnglin in #862
[Fix] transformers import by @Yunnglin in #865
[Benchmark] Add OCRBench and OCRBench-v2 by @Yunnglin in #869
[Fix] None string error by @Yunnglin in #871

Full Changelog: v1.0.2...v1.1.0

@Yunnglin

新增功能

代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行，要使用该功能需先安装ms-enclave。
新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准，和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

[Benchmark] add Multi-IF by @Yunnglin in #822
Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
[Benchmark] Add health bench by @Yunnglin in #826
fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
[Fix] vlm tokenize by @Yunnglin in #829
[Doc] update qwen next doc by @Yunnglin in #832
[Fix] fix bfcl-v3 score by @Yunnglin in #833
[Benchmark] Add MMBench and MMStar by @mushenL in #834
[Benchmark] Add Omnibench by @Yunnglin in #837
[Fix] Fix bfcl validation error by @Yunnglin in #838
[Feature] add docker sandbox by @Yunnglin in #835
[Fix] Fix thread pool error by @Yunnglin in #841
[Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
[Benchmark] add minerva-math by @Yunnglin in #846

New Contributors

@MemoryIt made their first contribution in #827

Full Changelog: v1.0.1...v1.0.2

Releases: modelscope/evalscope

v1.5.1

中文版

基准测试数据集

功能增强

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.0

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.4.2

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

Contributors

Uh oh!

v1.4.1

中文版

基准测试数据集

功能增强

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.4.0

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.0

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements