Skip to content

Releases: EleutherAI/lm-evaluation-harness

v0.4.6

25 Nov 13:38
9d36354
Compare
Choose a tag to compare

lm-eval v0.4.6 Release Notes

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Multilingual Expansion

  • Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
  • Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439

New Task Collections

  • Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
  • Metabench: New benchmark contributed by @kozzy97 in #2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

New Contributors

Full Changelog: v0.4.5...v0.4.6

v0.4.5

08 Oct 21:06
0845b58
Compare
Choose a tag to compare

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val) task and we welcome contributions and feedback from the community!

New VLM-Specific Arguments

VLM models can be configured with several new arguments within --model_args to support their specific requirements:

  • max_images (int): Set the maximum number of images for each prompt.
  • interleave (bool): Determines the positioning of image inputs. When True (default) images are interleaved with the text. When False all images are placed at the front of the text. This is model dependent.

hf-multimodal specific args:

  • image_token_id (int) or image_string (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an "<image>" string to indicate the location of images in the input, while Qwen2-VL models expect an "<|image_pad|>" sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family
  • convert_img_format (bool): Whether to convert the images to RGB format.

Example usage:

  • lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template

  • lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template

Important considerations

  1. Chat Template: Most VLMs require the --apply_chat_template flag to ensure proper input formatting according to the model's expected chat template.
  2. Some VLM models are limited to processing a single image per prompt. For these models, always set max_images=1. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring interleave=False.
  3. Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.

Tested VLM Models

We have currently most notably tested the implementation with the following models:

  • llava-hf/llava-1.5-7b-hf
  • llava-hf/llava-v1.6-mistral-7b-hf
  • Qwen/Qwen2-VL-2B-Instruct
  • HuggingFaceM4/idefics2 (requires the latest transformers from source)

New Tasks

Several new tasks have been contributed to the library for this version!

New tasks as of v0.4.5 include:

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Backwards Incompatibilities

Finalizing group versus tag split

We've now fully deprecated the use of group keys directly within a task's configuration file. The appropriate key to use is now solely tag for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.

Handling of Causal vs. Seq2seq backend in HFLM

In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS to a different factory class, such as transformers.AutoModelForVision2Seq.

As a result, those users who subclass HFLM but do not call HFLM.__init__() may now also need to set the self.backend attribute to either "causal" or "seq2seq" during initialization themselves.

While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see #2353 for the full set of changes.

Future Plans

We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!

Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)

What's Changed

New Contributors

Full Changelog: https://github.com/Eleu...

Read more

v0.4.4

05 Sep 15:13
543617f
Compare
Choose a tag to compare

lm-eval v0.4.4 Release Notes

New Additions

  • This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using --tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here.

  • API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.

    • The url can be specified by passing the base_url to --model_args, for example, base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with the num_concurrent argument; tokenization is controlled with tokenized_requests.
    • Other arguments (such as top_p, top_k, etc.) can be passed to the API using --gen_kwargs as usual.
    • Note: Instruct-tuned models, not just base models, can be used with local-completions using --apply_chat_template (either with or without tokenized_requests).
      • They can also be used with local-chat-completions (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
    • example with OpenAI completions API (using vllm serve):
      • lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
    • example with chat API:
      • lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
    • We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under local-completions!
  • We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.

  • A combination of data-parallel and model-parallel (using HF's device_map functionality for "naive" pipeline parallel) inference using --model hf is now supported, thank you to @NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.

New tasks as of v0.4.4 include:

Backwards Incompatibilities

tags versus groups, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:

# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...

These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!

To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):

# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...

Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...
 ### New! Needed to turn on aggregation ###
 aggregate_metric_list:
  - metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
  - weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.
  

Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)

What's Changed

Read more

v0.4.3

01 Jul 14:00
3fa4fd7
Compare
Choose a tag to compare

lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

  • Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!

  • NeMo model support by @sergiopperez !

  • Anthropic Chat API support by @tryuman !

  • DeepSparse and SparseML model types by @mgoin !

  • Handling of delta-weights in HF models, by @KonradSzafer !

  • LoRA support for VLLM, by @bcicc !

  • Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !

  • Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !

  • The use of custom Sampler subclasses in tasks, by @LSinev !

  • The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !

  • Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!

  • Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !

  • extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!

  • Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

  • ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
  • BasqueGlue and EusExams, two Basque-language tasks by @juletx
  • TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
  • XNLIeu, a Basque version of XNLI, by @juletx
  • Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
  • FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
  • Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
  • COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
  • tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
  • Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
  • New FLD (formal logic) task variants by @MorishT
  • Improved translations of Lambada Multilingual tasks, added by @zafstojano
  • NoticIA, a Spanish summarization dataset by @ikergarcia1996
  • The Paloma perplexity benchmark, added by @zafstojano
  • We've removed the AMMLU dataset due to concerns about auto-translation quality.
  • Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
  • BertaQA, a Basque cultural knowledge benchmark, by @juletx
  • New machine-translated ARC-C datasets by @jonabur !
  • CommonsenseQA, in a prompt format following Llama, by @murphybrendan
  • ...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

  • {output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
  • {output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See #1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

  • The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!

  • The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).

  • We'd also like to improve the API model support in the Eval Harness from its current state.

  • More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

Read more

v0.4.2

18 Mar 13:07
4600d6b
Compare
Choose a tag to compare

lm-eval v0.4.2 Release Notes

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions

  • Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
  • Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
  • New Tasks
  • Re-introduction of TemplateLM base class for lower-code new LM class implementations by @anjor
  • Run the library with metrics/scoring stage skipped via --predict_only by @baberabb
  • Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:

TaskManager API

previously, users had to call lm_eval.tasks.initialize_tasks() to register the library's default tasks, or lm_eval.tasks.include_path() to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks() 
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

 
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

get_task_dict() now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large stderr scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed

Read more

v0.4.1

31 Jan 15:29
a0a2fec
Compare
Choose a tag to compare

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

  • Data-parallel inference using vLLM (contributed by @baberabb )
  • A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
  • Miscellaneous documentation updates
  • A number of new tasks, and bugfixes to old tasks!
  • The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!
  • Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

  • Chat Templating + System Prompt support, for locally-run models
  • Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
  • General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
  • A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Read more

v0.4.0

04 Dec 15:08
c9bbec6
Compare
Choose a tag to compare

What's Changed

Read more

v0.3.0

08 Dec 08:34
Compare
Choose a tag to compare

HuggingFace Datasets Integration

This release integrates HuggingFace datasets as the core dataset management interface, removing previous custom downloaders.

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

07 Mar 02:12
Compare
Choose a tag to compare

Major changes since 0.1.0:

  • added blimp (#237)
  • added qasper (#264)
  • added asdiv (#244)
  • added truthfulqa (#219)
  • added gsm (#260)
  • implemented description dict and deprecated provide_description (#226)
  • new --check_integrity flag to run integrity unit tests at eval time (#290)
  • positional arguments to evaluate and simple_evaluate are now deprecated
  • _CITATION attribute on task modules (#292)
  • lots of bug fixes and task fixes (always remember to report task versions for comparability!)

v0.0.1

02 Sep 02:28
Compare
Choose a tag to compare
Rename package