Skip to content

Releases: modelscope/evalscope

v1.5.1

23 Mar 07:29

Choose a tag to compare

中文版

基准测试数据集

  • 新增 AIME 2026 数学竞赛基准测试
  • 新增 MMMLU 多语言大规模多任务理解基准测试
  • 新增 LongBench v2 长文本理解基准测试

功能增强

  • 性能测试: 新增获取基准端点(get benchmark endpoint)功能,修复测试连接参数配置
  • 性能测试: 优化 SLA 自动调优(SLA auto tune)功能
  • 评测服务: 支持以表格形式返回评测结果,修复分析统计相关问题
  • Judge 模型: 支持为 Judge LLM 配置 model_args 参数
  • 请求追踪: 支持打印 request id,便于请求追踪与调试

问题修复

  • 修复 eval() 安全性问题,替换为 ast.literal_eval() 处理字符串参数解析
  • 修复性能测试(perf)tokenize 及空子集(empty subset)相关问题
  • 修复数据集 shuffle 随机性问题,使用带种子的 random.Random 确保可复现性

English Version

Benchmark Datasets

  • Added AIME 2026 math competition benchmark
  • Added MMMLU (Multilingual Massive Multitask Language Understanding) benchmark
  • Added LongBench v2 for long-context understanding evaluation

Feature Enhancements

  • Performance Testing: Added get benchmark endpoint and fixed test connection parameter configuration
  • Performance Testing: Added SLA auto-tune functionality
  • Evaluation Service: Support returning results in table format and fixed analysis bugs
  • Judge Model: Added model_args support for judge LLM configuration
  • Request Tracking: Added request ID printing for better request tracing and debugging

Bug Fixes

  • Fixed security issue by replacing eval() with ast.literal_eval() for string argument parsing
  • Fixed perf tokenize and empty subset related issues
  • Fixed dataset shuffling reproducibility by using seeded random.Random

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.5.1

v1.5.0

10 Mar 02:01

Choose a tag to compare

中文版

基准测试数据集

  • 数学评测: 新增 HMMT25 数学基准测试
  • 代码评测: 新增 CL-bench (腾讯) 基准测试
  • 修复 LiveCodeBench 代码提取逻辑,改为使用最后一个代码块

功能增强

  • Judge LLM 类型指定: 支持在评测中指定 Judge LLM 的类型
  • 火山引擎沙箱支持: 新增 Volcengine 沙箱环境支持
  • Anthropic API 支持: 新增 Anthropic API 接入能力
  • 性能测试进度条: 为 perf 数据集处理新增 tqdm 进度条显示
  • 统一子集更新: 更新 unify subset 相关逻辑
  • 服务端 Demo 更新: 优化 server demo 展示

文档优化

  • 新增基准测试详情文档说明
  • 更新 eval_type 相关文档
  • 更新文档同步脚本

问题修复

  • 修复空数据集跳过处理问题
  • 修复 rate type 更新问题
  • 修复 input_audio 错误前缀问题 (Issue #1152)

English Version

Benchmark Datasets

  • Math Evaluation: Added HMMT25 math benchmark
  • Code Evaluation: Added CL-bench (Tencent) benchmark
  • Fixed LiveCodeBench code extraction to use the last fenced code block

Feature Enhancements

  • Judge LLM Type: Added support for specifying judge LLM type in evaluation
  • Volcengine Sandbox: Added Volcengine sandbox environment support
  • Anthropic API: Added Anthropic API integration support
  • Performance Testing Progress Bar: Added tqdm progress display for perf dataset processing
  • Unified Subset Update: Updated unify subset related logic
  • Server Demo Update: Optimized server demo presentation

Documentation

  • Added benchmark detail documentation
  • Updated eval_type related documentation
  • Updated doc sync script

Bug Fixes

  • Fixed empty dataset skipping issue
  • Fixed rate type update issue
  • Fixed input_audio wrong prefix issue (Issue #1152)

What's Changed

New Contributors

Full Changelog: v1.4.2...v1.5.0

v1.4.2

19 Jan 05:45

Choose a tag to compare

中文版

基准测试数据集

  • 代码评测: 新增 HumanEvalPlus、MBPPPlus 等代码能力评测

功能增强

  • 性能测试: 新增对 Embedding 和 Rerank 模型的性能测试支持

文档优化

  • 新增 general_fc 最佳实践文档
  • 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档
  • 更新性能测试文档,新增 Embedding 和 Rerank 模型评测说明

问题修复

  • 修复性能测试日志输出问题
  • 修复 SimpleVQA 图像加载问题

English Version

Benchmark Datasets

  • Code Evaluation: Added HumanEvalPlus and MBPPPlus for code capability assessment

Feature Enhancements

  • Performance Testing: Added support for Embedding and Rerank models performance evaluation

Documentation

  • Added general_fc best practice documentation
  • Updated collection documentation with support for custom index construction
  • Updated performance testing documentation with Embedding and Rerank model evaluation instructions

Bug Fixes

  • Fixed performance testing log output issues
  • Fixed SimpleVQA image loading issues

What's Changed

Full Changelog: v1.4.1...v1.4.2

v1.4.1

05 Jan 07:28

Choose a tag to compare

中文版

基准测试数据集

  • 命名实体识别: 新增 12 个 NER(命名实体识别)数据集
  • 语音识别: 新增 TORGO 数据集,用于构音障碍语音识别评测,支持 SemScore 评估
  • 多模态评测: 新增 RefCOCO 基准测试
  • 代码评测: 新增 Terminal-bench 终端命令能力评测

功能增强

  • 性能测试: 新增 SLA 自动调优功能,优化性能测试体验
  • 服务模式: 新增异步服务支持和 Gradio UI 界面
  • 数据加载: 优化本地 JSONL 数据集加载功能

问题修复

  • 修复 HallusionBench 数据加载问题
  • 修复流式响应解析中的 SSE 分块处理问题

English Version

Benchmark Datasets

  • Named Entity Recognition: Added 12 NER (Named Entity Recognition) datasets
  • Speech Recognition: Added TORGO dataset for dysarthria speech recognition with SemScore evaluation
  • Multimodal Evaluation: Added RefCOCO referring expression comprehension benchmark
  • Code Evaluation: Added Terminal-bench for terminal command capability assessment

Feature Enhancements

  • Performance Testing: Added SLA auto-tuning functionality to optimize performance testing experience
  • Service Mode: Added asynchronous service support and Gradio UI interface
  • Data Loading: Optimized local JSONL dataset loading functionality

Bug Fixes

  • Fixed HallusionBench data loading issues
  • Fixed SemScore computation errors
  • Fixed eval_config loading related issues

What's Changed

New Contributors

Full Changelog: v1.4.0...v1.4.1

v1.4.0

16 Dec 09:17

Choose a tag to compare

中文版

基准测试数据集

  • 通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
  • 代码评测: 新增 MultiplE、MBPP 等代码能力评测
  • 语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

  • 性能测试可视化: 新增 ClearML 可视化支持,优化性能测试(perf)监控能力
  • 服务API: 新增 service api 功能,提供更灵活的服务调用方式,参考文档
  • 懒加载模型: 新增 lazy model 支持,优化模型加载机制
  • 重试机制: 新增 retry function,提升评测稳定性
  • 沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
  • 随机算法优化: 更新性能测试随机算法,提升测试准确性
  • UI增强: Dashboard 支持 HTTP params 参数配置
  • 进度条优化: 更新 tqdm 进度显示机制

文档优化

  • 更新自定义 VQA 相关文档
  • 更新参数配置相关文档
  • 更新基准测试(benchmarks)文档
  • 更新服务(service)相关文档
  • 更新 MTEB 相关链接

问题修复

  • 修复 --analysis-report、--dataset-dir 等命令行参数问题
  • 修复并发为1时的令牌吞吐量计算问题
  • 修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
  • 修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
  • 修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

  • General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
  • Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
  • Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

  • Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
  • Service API: Added service api functionality for more flexible service invocation
  • Lazy Model Loading: Added lazy model support to optimize model loading mechanism
  • Retry Mechanism: Added retry function to improve evaluation stability
  • Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
  • Random Algorithm: Updated performance testing random algorithm for improved accuracy
  • UI Enhancement: Dashboard now supports HTTP params parameter configuration
  • Progress Bar: Updated tqdm progress display mechanism

Documentation

  • Updated custom VQA documentation
  • Updated parameter configuration documentation
  • Updated benchmarks documentation
  • Updated service documentation
  • Updated MTEB related links

Bug Fixes

  • Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
  • Fixed token throughput calculation at concurrency 1
  • Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
  • Fixed SWE-bench image build and MRCR leading newline support
  • Fixed NLTK resource checking issues

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0

28 Nov 07:51

Choose a tag to compare

中文版

基准测试数据集

  • 多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
  • 代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
  • 通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

  • 自定义工具调用评测: 支持自定义函数调用(function-call)评测能力,参考使用文档
  • 自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持,参考使用文档
  • 聚合评分: 更新聚合(agg)参数,优化评分聚合机制
  • 性能测试: 优化性能测试(perf)相关参数配置

文档优化

  • 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档

问题修复

  • 修复 perf completion endpoint streaming 相关问题
  • 修复 judge model 错误日志显示问题
  • 修复 --no-test-connection 参数 action 问题
  • 修复函数调用类测试用例错误处理问题(Issue #1005)
  • 修复 model args 相关问题

English Version

Benchmark Datasets

  • Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
  • Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
  • General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

  • Custom Evaluation: Added support for custom function-call evaluation
  • Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
  • Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
  • Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
  • Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

  • Updated eval_type related documentation
  • Updated collection documentation

Bug Fixes

  • Fixed perf completion endpoint streaming issues
  • Fixed error log display for judge model
  • Fixed --no-test-connection parameter action issue
  • Fixed error handling for function-call test cases (Issue #1005)
  • Fixed model args related issues

What's Changed

New Contributors

Full Changelog: v1.2.0...v1.3.0

v1.2.0

11 Nov 04:58

Choose a tag to compare

中文版

基准测试数据集

  • 新增多个MCQA(多项选择问答)数据集
  • 新增Drivelology基准测试
  • 更新BFCL-v3,新增支持BFCL-v4基准测试
  • 更新tau-bench,新增支持tau2-bench
  • 支持WMT机器翻译评测和相关指标

功能增强

  • 优化答案提取机制 - 使答案提取过程更加明确和可控
  • 支持batch计算指标,例如Bertscore等
  • 更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
  • 更新OpenAI API参数 - 优化API调用参数配置

数据源更新

  • 更新SimpleQA数据源 - 使用最新的SimpleQA数据
  • 对齐AIME到AA标准 - 统一评测标准
  • 更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

  • 修复DROP数据集few_shot_num=3的问题
  • 修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

  • Added multiple MCQA (Multiple Choice Question Answering) datasets
  • Added Drivelology benchmark
  • Updated BFCL-v3 and added support for BFCL-v4 benchmark
  • Updated tau-bench and added support for tau2-bench
  • Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

  • Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
  • Added support for batch metric computation, such as Bertscore
  • Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
  • Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

  • Updated SimpleQA data source - using the latest SimpleQA data
  • Aligned AIME to AA standard - unified evaluation standards
  • Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

  • Fixed the issue with DROP dataset when few_shot_num=3
  • Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

New Contributors

Full Changelog: v1.1.1...v1.2.0

v1.1.1

27 Oct 09:11

Choose a tag to compare

更新

  1. 基准测试扩展
  • 视觉/多模态评测:HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
  • 文档理解: OmniDocBench
  • NLP任务: CoNLL2003、NER 任务集合(9个任务)、AA-LCR
  • 逻辑推理: VisuLogic、ZeroBench
  1. 功能增强
  • 性能基准测试优化:perf 功能优化,可获得与 vLLM benchmarking 相媲美的测试结果,参考使用文档
  • 代码评测环境增强:沙箱环境支持本地/远程双模式运行,提升代码安全性与灵活性,参考使用文档
  1. 性能与稳定性优化
  • 修复数据集中 prompt tokens 计算问题
  • 增加评测过程中心跳检测机制
  • 修复 GSM8K 准确率计算并增强日志记录
  1. 系统要求更新
  • Python版本要求:提升至 ≥3.10 (无依赖更新)

Updates

  1. Benchmark Extensions
  • Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
  • Document Understanding: OmniDocBench
  • NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
  • Logic Reasoning: VisuLogic, ZeroBench
  1. Feature Enhancements
  • Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
  • Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation
  1. Performance and Stability Improvements
  • Fixed prompt tokens calculation issues in datasets
  • Added heartbeat detection mechanism during evaluation process
  • Fixed GSM8K accuracy calculation and enhanced logging
  1. System Requirements Update
  • Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

14 Oct 09:20

Choose a tag to compare

更新

  • 支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准,所有支持的数据集请参考
  • 编写Qwen3-OmniQwen3-VL模型评测最佳实践
  • 支持pyproject.toml安装

Update

  • The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
  • Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
  • Installation via pyproject.toml is now supported.

What's Changed

Full Changelog: v1.0.2...v1.1.0

v1.0.2

23 Sep 09:30

Choose a tag to compare

新增功能

  • 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave
  • 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

  • Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
  • Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

New Contributors

Full Changelog: v1.0.1...v1.0.2