Releases: EleutherAI/lm-evaluation-harness
v0.4.6
lm-eval v0.4.6 Release Notes
This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.
Backwards Incompatibilities
Chat Template Delimiter Handling
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
Multilingual Expansion
- Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
- Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439
New Task Collections
- Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
- Metabench: New benchmark contributed by @kozzy97 in #2357
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Add Unitxt Multimodality Support by @elronbandel in #2364
- Add new tasks to spanish_bench and fix duplicates by @zxcvuser in #2390
- fix typo bug for minerva_math by @renjie-ranger in #2404
- Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in #2393
- fix storycloze datanames by @t1101675 in #2409
- Update NoticIA prompt by @ikergarcia1996 in #2421
- [Fix] Replace generic exception classes with a more specific ones by @LSinev in #1989
- Support for IBM watsonx_llm by @Medokins in #2397
- Fix package extras for watsonx support by @kiersten-stokes in #2426
- Fix lora requests when dp with vllm by @ckgresla in #2433
- Add xquad task by @zxcvuser in #2435
- Add verify_certificate argument to local-completion by @sjmonson in #2440
- Add GPTQModel support for evaluating GPTQ models by @Qubitium in #2217
- Add missing task links by @Sypherd in #2449
- Update CODEOWNERS by @haileyschoelkopf in #2453
- Add real process_docs example by @Sypherd in #2456
- Modify label errors in catcola and paws-x by @zxcvuser in #2434
- Add Japanese Leaderboard by @sitfoxfly in #2439
- Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in #2459
- use global
multi_choice_filter
for mmlu_flan by @baberabb in #2461 - typo by @baberabb in #2465
- pass device_map other than auto for parallelize by @baberabb in #2457
- OpenAI ChatCompletions: switch
max_tokens
by @baberabb in #2443 - Ifeval: Dowload
punkt_tab
on rank 0 by @baberabb in #2267 - Fix chat template; fix leaderboard math by @baberabb in #2475
- change warning to debug by @baberabb in #2481
- Updated wandb logger to use
new_printer()
instead ofget_printer(...)
by @alex-titterton in #2484 - IBM watsonx_llm fixes & refactor by @Medokins in #2464
- Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in #2492
- update pre-commit hooks and git actions by @baberabb in #2497
- kbl-v0.1.1 by @whwang299 in #2493
- Add mamba hf to
mamba_ssm
by @baberabb in #2496 - remove duplicate
arc_ca
tag by @baberabb in #2499 - Add metabench task to LM Evaluation Harness by @kozzy97 in #2357
- Nits by @baberabb in #2500
- [API models] parse tokenizer_backend=None properly by @baberabb in #2509
New Contributors
- @renjie-ranger made their first contribution in #2404
- @t1101675 made their first contribution in #2409
- @Medokins made their first contribution in #2397
- @kiersten-stokes made their first contribution in #2426
- @ckgresla made their first contribution in #2433
- @sjmonson made their first contribution in #2440
- @Qubitium made their first contribution in #2217
- @Sypherd made their first contribution in #2449
- @sitfoxfly made their first contribution in #2439
- @RobGeada made their first contribution in #2459
- @alex-titterton made their first contribution in #2484
- @OyvindTafjord made their first contribution in #2492
- @whwang299 made their first contribution in #2493
- @kozzy97 made their first contribution in #2357
Full Changelog: v0.4.5...v0.4.6
v0.4.5
lm-eval v0.4.5 Release Notes
New Additions
Prototype Support for Vision Language Models (VLMs)
We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal
and vllm-vlm
. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val
) task and we welcome contributions and feedback from the community!
New VLM-Specific Arguments
VLM models can be configured with several new arguments within --model_args
to support their specific requirements:
max_images
(int): Set the maximum number of images for each prompt.interleave
(bool): Determines the positioning of image inputs. WhenTrue
(default) images are interleaved with the text. WhenFalse
all images are placed at the front of the text. This is model dependent.
hf-multimodal
specific args:
image_token_id
(int) orimage_string
(str): Specifies a custom token or string for image placeholders. For example, Llava models expect an"<image>"
string to indicate the location of images in the input, while Qwen2-VL models expect an"<|image_pad|>"
sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model familyconvert_img_format
(bool): Whether to convert the images to RGB format.
Example usage:
-
lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
-
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template
Important considerations
- Chat Template: Most VLMs require the
--apply_chat_template
flag to ensure proper input formatting according to the model's expected chat template. - Some VLM models are limited to processing a single image per prompt. For these models, always set
max_images=1
. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiringinterleave=False
. - Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.
Tested VLM Models
We have currently most notably tested the implementation with the following models:
- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-v1.6-mistral-7b-hf
- Qwen/Qwen2-VL-2B-Instruct
- HuggingFaceM4/idefics2 (requires the latest
transformers
from source)
New Tasks
Several new tasks have been contributed to the library for this version!
New tasks as of v0.4.5 include:
- Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
- MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
- TurkishMMLU by @ArdaYueksel in #2283
- PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Backwards Incompatibilities
Finalizing group
versus tag
split
We've now fully deprecated the use of group
keys directly within a task's configuration file. The appropriate key to use is now solely tag
for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.
Handling of Causal vs. Seq2seq backend in HFLM
In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS
to a different factory class, such as transformers.AutoModelForVision2Seq
.
As a result, those users who subclass HFLM but do not call HFLM.__init__()
may now also need to set the self.backend
attribute to either "causal"
or "seq2seq"
during initialization themselves.
While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see #2353 for the full set of changes.
Future Plans
We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!
Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)
What's Changed
- Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in #2232
- Multimodal prototyping by @lintangsutawika in #2243
- Update README.md by @SYusupov in #2297
- remove comma by @baberabb in #2315
- Update neuron backend by @dacorvo in #2314
- Fixed dummy model by @Am1n3e in #2339
- Add a note for missing dependencies by @eldarkurtic in #2336
- squad v2: load metric with
evaluate
by @baberabb in #2351 - fix writeout script by @baberabb in #2350
- Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in #2288
- change group to tags in task
eus_exams
task configs by @baberabb in #2320 - change glianorex to test split by @baberabb in #2332
- mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in #2334
- Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in #2283
- add mmlu readme by @baberabb in #2282
- openai: better error messages; fix greedy matching by @baberabb in #2327
- fix some bugs of mmlu by @eyuansu62 in #2299
- Add new benchmark: Portuguese bench by @zxcvuser in #2156
- Fix missing key in custom task loading. by @giuliolovisotto in #2304
- Add new benchmark: Spanish bench by @zxcvuser in #2157
- Add new benchmark: Galician bench by @zxcvuser in #2155
- Add new benchmark: Basque bench by @zxcvuser in #2153
- Add new benchmark: Catalan bench by @zxcvuser in #2154
- fix tests by @baberabb in #2380
- Hotfix! by @baberabb in #2383
- Solution for CSAT-QA tasks evaluation by @KyujinHan in #2385
- LingOly - Fixing scoring bugs for smaller models by @am-bean in #2376
- Fix float limit override by @cjluo-omniml in #2325
- [API] tokenizer: add trust-remote-code by @baberabb in #2372
- HF: switch conditional checks to
self.backend
fromAUTO_MODEL_CLASS
by @baberabb in #2353 - max_images are passed on to vllms
limit_mm_per_prompt
by @baberabb in #2387 - Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in #2388
- Bump version to v0.4.5 by @haileyschoelkopf in #2389
New Contributors
- @Malikeh97 made their first contribution in #2232
- @SYusupov made their first contribution in #2297
- @dacorvo made their first contribution in #2314
- @eldarkurtic made their first contribution in #2336
- @giuliolovisotto made their first contribution in #2288
- @ArdaYueksel made their first contribution in #2283
- @zxcvuser made their first contribution in #2156
- @KyujinHan made their first contribution in #2385
- @cjluo-omniml made their first contribution in #2325
Full Changelog: https://github.com/Eleu...
v0.4.4
lm-eval v0.4.4 Release Notes
New Additions
-
This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using
--tasks leaderboard
. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. -
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the
base_url
to--model_args
, for example,base_url=http://localhost:8000/v1/completions
; concurrent requests are controlled with thenum_concurrent
argument; tokenization is controlled withtokenized_requests
. - Other arguments (such as top_p, top_k, etc.) can be passed to the API using
--gen_kwargs
as usual. - Note: Instruct-tuned models, not just base models, can be used with
local-completions
using--apply_chat_template
(either with or withouttokenized_requests
).- They can also be used with
local-chat-completions
(for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- They can also be used with
- example with OpenAI completions API (using vllm serve):
lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under
local-completions
!
- The url can be specified by passing the
-
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
-
A combination of data-parallel and model-parallel (using HF's
device_map
functionality for "naive" pipeline parallel) inference using--model hf
is now supported, thank you to @NathanHB and team!
Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!
New Tasks
A number of new tasks have been contributed to the library.
As a further discoverability improvement, lm_eval --tasks list
now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.
New tasks as of v0.4.4 include:
- Open LLM Leaderboard 2 tasks--see above!
- Inverse Scaling tasks, contributed by @h-albert-lee in #1589
- Unitxt tasks reworked by @elronbandel in #1933
- MMLU-SR, contributed by @SkySuperCat in #2032
- IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
- MedConceptQA, contributed by @Ofir408 in #2010
- MMLU Pro, contributed by @ysjprojects in #1961
- GSM-Plus, contributed by @ysjprojects in #2103
- Lingoly, contributed by @am-bean in #2198
- GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
- TMLU, contributed by @adamlin120 in #2093
- Mela, contributed by @Geralt-Targaryen in #1970
Backwards Incompatibilities
tag
s versus group
s, and how to migrate
Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu
to aggregate and report a unified score across a set of component "subtasks".
There were two ways to add a task to a given group
name: 1) to provide (a list of) values to the group
field in a given subtask's config file:
# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...
or 2) to define a "group config file" and specify a group along with its constituent subtasks:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...
These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1
. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.
We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag
and group
property separately!
To register a shorthand (now called a tag
), simply change the group
field name within your task's config to be tag
(group_alias
keys will no longer be supported in task configs.):
# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...
Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...
### New! Needed to turn on aggregation ###
aggregate_metric_list:
- metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
- weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.
Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.
Future Plans
We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.
Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)
What's Changed
- fix wandb logger module import in example by @ToluClassics in #2041
- Fix strip whitespace filter by @NathanHB in #2048
- Gemma-2 also needs default
add_bos_token=True
by @haileyschoelkopf in #2049 - Update
trust_remote_code
for Hellaswag by @haileyschoelkopf in #2029 - Adds Open LLM Leaderboard Taks by @NathanHB in #2047
- #1442 inverse scaling tasks implementation by @h-albert-lee in #1589
- Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
- Group agg rework by @lintangsutawika in #1741
- Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
- Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
- Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
- Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
- Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
- Update utils.py by @lintangsutawika in #2085
- batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
- Prettify lm_eval --tasks list by @anthony-dipofi in #1929
- Suppress noisy RougeScorer logs in
truthfulqa_gen
by @haileyschoelkopf in #2090 - Update default.yaml by @waneon in #2092
- Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
- Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
- docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
- Added MedConceptsQA Benchmark by @Ofir408 in #2010
- Also force BOS for
"recurrent_gemma"
and other Gemma model types by @haileyschoelkopf in #2105 - formatting by @lintangsutawika in #2104
- docs: align local test command to match CI by @nathan-weinberg in https://gith...
v0.4.3
lm-eval v0.4.3 Release Notes
We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.
New Additions
The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!
You can now run using a chat template with --apply_chat_template
and a system prompt of your choosing using --system_instruction "my sysprompt here"
. The --fewshot_as_multiturn
flag can control whether each few-shot example in context is a new conversational turn or not.
This feature is currently only supported for model types hf
and vllm
but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.
There's a lot more to check out, including:
-
Logging results to the HF Hub if desired using
--hf_hub_log_args
, by @KonradSzafer and team! -
NeMo model support by @sergiopperez !
-
Anthropic Chat API support by @tryuman !
-
DeepSparse and SparseML model types by @mgoin !
-
Handling of delta-weights in HF models, by @KonradSzafer !
-
LoRA support for VLLM, by @bcicc !
-
Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !
-
Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !
-
The use of custom
Sampler
subclasses in tasks, by @LSinev ! -
The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !
-
Support for Ascend NPUs (
--device npu
) by @statelesshz, @zhabuye, @jiaqiw09 and others! -
Logging of
higher_is_better
in results tables for clearer understanding of eval metrics by @zafstojano ! -
extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!
-
Miscellaneous bug fixes! And many more great contributions we weren't able to list here.
New Tasks
We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md
. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks
subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!
Without further ado, the tasks:
- ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
- BasqueGlue and EusExams, two Basque-language tasks by @juletx
- TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
- XNLIeu, a Basque version of XNLI, by @juletx
- Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
- FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
- Added back the
hendrycks_math
task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing - COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
- tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
- Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
- New FLD (formal logic) task variants by @MorishT
- Improved translations of Lambada Multilingual tasks, added by @zafstojano
- NoticIA, a Spanish summarization dataset by @ikergarcia1996
- The Paloma perplexity benchmark, added by @zafstojano
- We've removed the AMMLU dataset due to concerns about auto-translation quality.
- Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
- BertaQA, a Basque cultural knowledge benchmark, by @juletx
- New machine-translated ARC-C datasets by @jonabur !
- CommonsenseQA, in a prompt format following Llama, by @murphybrendan
- ...
Backwards Incompatibilities
The save format for logged results has now changed.
output files will now be written to
{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json
if--output_path
is set, and{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl
for each task's samples if--log_samples
is set.
e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json
and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl
.
See #1926 for utilities which may help to work with these new filenames.
Future Plans
In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!
-
The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in
v0.4.4
on PyPI! -
The fact that
group
s of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish betweengroup
s of tasks that do report aggregate scores (thinkmmlu
) versustag
s which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think thepythia
grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense). -
We'd also like to improve the API model support in the Eval Harness from its current state.
-
More to come!
Thank you to everyone who's contributed to or used the library!
Thanks, @haileyschoelkopf @lintangsutawika
What's Changed
- use BOS token in loglikelihood by @djstrong in #1588
- Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in #1601
- fix gen_kwargs arg reading by @artemorloff in #1607
- fix until arg processing by @artemorloff in #1608
- Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in #1611
- Add ACLUE task by @haonan-li in #1614
- OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in #1612
- add logging of model args by @baberabb in #1619
- Add vLLM FAQs to README (#1625) by @haileyschoelkopf in #1633
- peft Version Assertion by @LameloBally in #1635
- Seq2seq fix by @lintangsutawika in #1604
- Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in #1598
- Fix conditional import for Nemo LM class by @haileyschoelkopf in #1641
- Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in #1647
- Add Latxa paper evaluation tasks for Basque by @juletx in #1654
- Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in #1656
- Patch QQP prompt (#1648 ) by @haileyschoelkopf in #1661
- TMMLU+ implementation by @ZoneTwelve in #1394
- Anthropic Chat API by @tryumanshow in #1594
- correction bug #1664 by @nicho2 in #1670
- Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in #1680
- Add delta weights model loading by @KonradSzafer in #1712
- Add
neuralmagic
models forsparseml
anddeepsparse
by @mgoin in #1674 - Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in #1699
- Adding retries and rate limit to toxicity tasks by @sator-labs in #1620
- reference
--tasks list
in README by @nairbv in #1726 - Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in #1694
- Fix Parameter Propagation for Tasks that have
include
by @lintangsutawika in #1749 - Support individual scrolls datasets by @giorgossideris in #1740
- Add filter registry decorator by @lozhn in #1750
- remove duplicated
num_fewshot: 0
by @chujiezheng in #1769 - Pile 10k new task by @mukobi in #1758
- Fix m_arc choices by @jordane95 in #1760
- upload new tasks by @simran-arora in https://github.com/EleutherAI/lm-eva...
v0.4.2
lm-eval v0.4.2 Release Notes
We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!
New Additions
- Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
- Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
- New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by @h-albert-lee @guijinSON
- GPQA by @uanu2002
- French Bench by @ManuelFay
- EQ-Bench by @pbevan1 and @sqrkl
- HAERAE-Bench, readded by @h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by @thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by @uanu2002 and @giux78
- Arabic MMLU and aEXAMS by @khalil-Hennara
- And more!
- Re-introduction of
TemplateLM
base class for lower-code new LM class implementations by @anjor - Run the library with metrics/scoring stage skipped via
--predict_only
by @baberabb - Many more miscellaneous improvements by a lot of great contributors!
Backwards Incompatibilities
There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:
TaskManager
API
previously, users had to call lm_eval.tasks.initialize_tasks()
to register the library's default tasks, or lm_eval.tasks.include_path()
to include a custom directory of task YAML configs.
Old usage:
import lm_eval
lm_eval.tasks.initialize_tasks()
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])
New intended usage:
import lm_eval
# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)
get_task_dict()
now also optionally takes a TaskManager object, when wanting to load custom tasks.
This should allow for much faster library startup times due to lazily loading requested tasks or groups.
Updated Stderr Aggregation
Previous versions of the library incorrectly reported erroneously large stderr
scores for groups of tasks such as MMLU.
We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.
As always, please feel free to give us feedback or request new features! We're grateful for the community's support.
What's Changed
- Add support for RWKV models with World tokenizer by @PicoCreator in #1374
- add bypass metric by @baberabb in #1156
- Expand docs, update CITATION.bib by @haileyschoelkopf in #1227
- Hf: minor egde cases by @baberabb in #1380
- Enable override of printed
n-shot
in table by @haileyschoelkopf in #1379 - Faster Task and Group Loading, Allow Recursive Groups by @lintangsutawika in #1321
- Fix for #1383 by @pminervini in #1384
- fix on --task list by @lintangsutawika in #1387
- Support for Inf2 optimum class [WIP] by @michaelfeil in #1364
- Update README.md by @mycoalchen in #1398
- Fix confusing
write_out.py
instructions in README by @haileyschoelkopf in #1371 - Use Pooled rather than Combined Variance for calculating stderr of task groupings by @haileyschoelkopf in #1390
- adding hf_transfer by @michaelfeil in #1400
batch_size
withauto
defaults to 1 ifNo executable batch size found
is raised by @pminervini in #1405- Fix printing bug in #1390 by @haileyschoelkopf in #1414
- Fixes #1416 by @pminervini in #1418
- Fix watchdog timeout by @JeevanBhoot in #1404
- Evaluate by @baberabb in #1385
- Add multilingual ARC task by @uanu2002 in #1419
- Add multilingual TruthfulQA task by @uanu2002 in #1420
- [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by @giux78 in #1358
- Added seeds to
evaluator.simple_evaluate
signature by @Am1n3e in #1412 - Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by @haileyschoelkopf in #1427
- Refactor utilities into a separate model utils file. by @baberabb in #1429
- Nit fix: Updated OpenBookQA Readme by @adavidho in #1430
- improve hf_transfer activation by @michaelfeil in #1438
- Correct typo in task name in ARC documentation by @larekrow in #1443
- update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by @thnkinbtfly in #1356
- Add a new task HaeRae-Bench by @h-albert-lee in #1445
- Group reqs by context by @baberabb in #1425
- Add a new task GPQA (the part without CoT) by @uanu2002 in #1434
- Added KMMLU evaluation method and changed ReadMe by @h-albert-lee in #1447
- Add TemplateLM boilerplate LM class by @anjor in #1279
- Log which subtasks were called with which groups by @haileyschoelkopf in #1456
- PR fixing the issue #1391 (wrong contexts in the mgsm task) by @leocnj in #1440
- feat: Add Weights and Biases support by @ayulockin in #1339
- Fixed generation args issue affection OpenAI completion model by @Am1n3e in #1458
- update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by @thnkinbtfly in #1462
- Adding documentation for Weights and Biases CLI interface by @veekaybee in #1466
- Add environment and transformers version logging in results dump by @LSinev in #1464
- Apply code autoformatting with Ruff to tasks/*.py an *init.py by @LSinev in #1469
- Setting trust_remote_code to
True
for HuggingFace datasets compatibility by @veekaybee in #1467 - add arabic mmlu by @khalil-Hennara in #1402
- Add Gemma support (Add flag to control BOS token usage) by @haileyschoelkopf in #1465
- Revert "Setting trust_remote_code to
True
for HuggingFace datasets compatibility" by @haileyschoelkopf in #1474 - Create a means for caching task registration and request building. Ad… by @inf3rnus in #1372
- Cont metrics by @lintangsutawika in #1475
- Refactor
evaluater.evaluate
by @baberabb in #1441 - add multilingual mmlu eval by @jordane95 in #1484
- Update TruthfulQA val split name by @haileyschoelkopf in #1488
- Fix AttributeError in huggingface.py When 'model_type' is Missing by @richwardle in #1489
- Fix duplicated kwargs in some model init by @lchu-ibm in #1495
- Add multilingual truthfulqa targets by @jordane95 in #1499
- Always include EOS token as stop sequence by @haileyschoelkopf in...
v0.4.1
Release Notes
This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .
At a high level, some of the changes include:
- Data-parallel inference using vLLM (contributed by @baberabb )
- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
- Miscellaneous documentation updates
- A number of new tasks, and bugfixes to old tasks!
- The support for OpenAI-like API models using
local-completions
orlocal-chat-completions
( Thanks to @veekaybee @mgoin @anjor and others on this)! - Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!
More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!
We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.
In the next version release, we hope to include
- Chat Templating + System Prompt support, for locally-run models
- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
- A new
TaskManager
object and the deprecation oflm_eval.tasks.initialize_tasks()
, for achieving the easier registration of many tasks and configuration of new groups of tasks
What's Changed
- Announce v0.4.0 in README by @haileyschoelkopf in #1061
- remove commented planned samplers in
lm_eval/api/samplers.py
by @haileyschoelkopf in #1062 - Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
- Set actual version to v0.4.0 by @haileyschoelkopf in #1064
- Updating docs hyperlinks by @haileyschoelkopf in #1066
- Fiddling with READMEs, Reenable CI tests on
main
by @haileyschoelkopf in #1063 - Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
- Patch scrolls by @lintangsutawika in #1077
- Update template of qqp dataset by @shiweijiezero in #1097
- Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
- Add kmmlu evaluation to tasks by @h-albert-lee in #1089
- Fix stderr by @lintangsutawika in #1106
- Simplified
evaluator.py
by @lintangsutawika in #1104 - [Refactor] vllm data parallel by @baberabb in #1035
- Unpack group in
write_out
by @baberabb in #1113 - Revert "Simplified
evaluator.py
" by @lintangsutawika in #1116 qqp
,mnli_mismatch
: remove unlabled test sets by @baberabb in #1114- fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
- Bump BBH version by @haileyschoelkopf in #1120
- Refactor
hf
modeling code by @haileyschoelkopf in #1096 - Additional process for doc_to_choice by @lintangsutawika in #1093
- doc_to_decontamination_query can use function by @lintangsutawika in #1082
- Fix vllm
batch_size
type by @xTayEx in #1128 - fix: passing max_length to vllm engine args by @NanoCode012 in #1124
- Fix Loading Local Dataset by @lintangsutawika in #1127
- place model onto
mps
by @baberabb in #1133 - Add benchmark FLD by @MorishT in #1122
- fix typo in README.md by @lennijusten in #1136
- add correct openai api key to README.md by @lennijusten in #1138
- Update Linter CI Job by @haileyschoelkopf in #1130
- add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
- Enabling OpenAI completions via gooseai by @veekaybee in #1141
- vllm clean up tqdm by @baberabb in #1144
- openai nits by @baberabb in #1139
- Add IFEval / Instruction-Following Eval by @wiskojo in #1087
- set
--gen_kwargs
arg to None by @baberabb in #1145 - Add shorthand flags by @baberabb in #1149
- fld bugfix by @baberabb in #1150
- Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
- Add docs on adding a multiple choice metric by @polm-stability in #1147
- Simplify evaluator by @lintangsutawika in #1126
- Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
- self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
- Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
- feat: add option to upload results to Zeno by @Sparkier in #990
- Switch Linting to
ruff
by @baberabb in #1166 - Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
- Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
- Update README.md by @anjor in #1184
- Update README.md by @anjor in #1183
- Add tokenizer backend by @anjor in #1186
- Correctly Print Task Versioning by @haileyschoelkopf in #1173
- update Zeno example and reference in README by @Sparkier in #1190
- Remove tokenizer for openai chat completions by @anjor in #1191
- Update README.md by @anjor in #1181
- disable
mypy
by @baberabb in #1193 - Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
- Refer in README to main branch by @BramVanroy in #1200
- Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
- Upstream Mamba Support (
mamba_ssm
) by @haileyschoelkopf in #1110 - Update cuda handling by @anjor in #1180
- Fix documentation in API table by @haileyschoelkopf in #1203
- Consolidate batching by @baberabb in #1197
- Add remove_whitespace to FLD benchmark by @MorishT in #1206
- Fix the argument order in
utils.divide
doc by @xTayEx in #1208 - [Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
- fix unbounded local variable by @onnoo in #1218
- nits + fix siqa by @baberabb in #1216
- add length of strings and answer options to Zeno met...
v0.4.0
What's Changed
- Replace stale
triviaqa
dataset link by @jon-tow in #364 - Update
actions/setup-python
in CI workflows by @jon-tow in #365 - Bump
triviaqa
version by @jon-tow in #366 - Update
lambada_openai
multilingual data source by @jon-tow in #370 - Update Pile Test/Val Download URLs by @fattorib in #373
- Added ToxiGen task by @Thartvigsen in #377
- Added CrowSPairs by @aflah02 in #379
- Add accuracy metric to crows-pairs by @haileyschoelkopf in #380
- hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in #384
- Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in #390
- Upstream
hf-causal
andhf-seq2seq
model implementations by @haileyschoelkopf in #381 - Hosting arithmetic dataset on HuggingFace by @fattorib in #391
- Hosting wikitext on HuggingFace by @fattorib in #396
- Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in #403
- Update README installation instructions by @haileyschoelkopf in #407
- feat: evaluation using peft models with CLM by @zanussbaum in #414
- Update setup.py dependencies by @ret2libc in #416
- fix: add seq2seq peft by @zanussbaum in #418
- Add support for load_in_8bit and trust_remote_code model params by @philwee in #422
- Hotfix: patch issues with the
huggingface.py
model classes by @haileyschoelkopf in #427 - Continuing work on refactor [WIP] by @haileyschoelkopf in #425
- Document task name wildcard support in README by @haileyschoelkopf in #435
- Add non-programmatic BIG-bench-hard tasks by @yurodiviy in #406
- Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in #447
- [WIP, Refactor] Staging more changes by @haileyschoelkopf in #465
- [Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in #467
- Configurable-Tasks by @lintangsutawika in #438
- single GPU automatic batching logic by @fattorib in #394
- Fix bugs introduced in #394 #406 and max length bug by @juletx in #472
- Sort task names to keep the same order always by @juletx in #474
- Set PAD token to EOS token by @nikhilpinnaparaju in #448
- [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in #486
- fix adaptive batch crash when there are no new requests by @jquesnelle in #490
- Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in #426
- Create output path directory if necessary by @janEbert in #483
- Add results of various models in json and md format by @juletx in #477
- Update config by @lintangsutawika in #501
- P3 prompt task by @lintangsutawika in #493
- Evaluation Against Portion of Benchmark Data by @kenhktsui in #480
- Add option to dump prompts and completions to a JSON file by @juletx in #492
- Add perplexity task on arbitrary JSON data by @janEbert in #481
- Update config by @lintangsutawika in #520
- Data Parallelism by @fattorib in #488
- Fix mgpt fewshot by @lintangsutawika in #522
- Extend
dtype
command line flag toHFLM
by @haileyschoelkopf in #523 - Add support for loading GPTQ models via AutoGPTQ by @gakada in #519
- Change type signature of
quantized
and its default value for python < 3.11 compatibility by @passaglia in #532 - Fix LLaMA tokenization issue by @gakada in #531
- [Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in #542
- Move spaces from context to continuation by @gakada in #546
- Use max_length in AutoSeq2SeqLM by @gakada in #551
- Fix typo by @kwikiel in #557
- Add load_in_4bit and fix peft loading by @gakada in #556
- Update task_guide.md by @haileyschoelkopf in #564
- [Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in #559
- Dataset metric log [WIP] by @lintangsutawika in #560
- Add Anthropic support by @zphang in #562
- Add MultipleChoiceExactTask by @gakada in #537
- Revert "Add MultipleChoiceExactTask" by @StellaAthena in #568
- [Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in #567
- Remove the registration of "GPT2" as a model type by @StellaAthena in #574
- [Refactor] Docs update by @haileyschoelkopf in #577
- Better docs by @lintangsutawika in #576
- Update evaluator.py cache_db argument str if model is not str by @poedator in #575
- Add --max_batch_size and --batch_size auto:N by @gakada in #572
- [Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in #581
- Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in #582
- Fix non-callable attributes in CachingLM by @gakada in #584
- Add error handling for calling
.to(device)
by @haileyschoelkopf in #585 - fixes some minor issues on tasks. by @lintangsutawika in #580
- Add - 4bit-related args by @SONG-WONHO in #579
- Fix triviaqa task by @seopbo in #525
- [Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in #578
- Logging Samples by @farzanehnakhaee70 in #563
- Merge master into big-refactor by @gakada in #590
- [Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in #596
- fixes for multiple_choice by @lintangsutawika in #598
- add openbookqa config by @farzanehnakhaee70 in #600
- [Refactor] Model guide docs by @haileyschoelkopf in #606
- [Refactor] More MCQA fixes by @haileyschoelkopf in #599
- [Refactor] Hellaswag by @nopperl in #608
- [Refactor] Seq2Seq Models with Multi-Device Support ...
v0.3.0
HuggingFace Datasets Integration
This release integrates HuggingFace datasets
as the core dataset management interface, removing previous custom downloaders.
What's Changed
- Refactor
Task
downloading to useHuggingFace.datasets
by @jon-tow in #300 - Add templates and update docs by @jon-tow in #308
- Add dataset features to
TriviaQA
by @jon-tow in #305 - Add
SWAG
by @jon-tow in #306 - Fixes for using lm_eval as a library by @dirkgr in #309
- Researcher2 by @researcher2 in #261
- Suggested updates for the task guide by @StephenHogg in #301
- Add pre-commit by @Mistobaan in #317
- Decontam import fix by @jon-tow in #321
- Add bootstrap_iters kwarg by @Muennighoff in #322
- Update decontamination.md by @researcher2 in #331
- Fix key access in squad evaluation metrics by @konstantinschulz in #333
- Fix make_disjoint_window for tail case by @richhankins in #336
- Manually concat tokenizer revision with subfolder by @jon-tow in #343
- [deps] Use minimum versioning for
numexpr
by @jon-tow in #352 - Remove custom datasets that are in HF by @jon-tow in #330
- Add
TextSynth
API by @jon-tow in #299 - Add the original
LAMBADA
dataset by @jon-tow in #357
New Contributors
- @dirkgr made their first contribution in #309
- @Mistobaan made their first contribution in #317
- @konstantinschulz made their first contribution in #333
- @richhankins made their first contribution in #336
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Major changes since 0.1.0:
- added blimp (#237)
- added qasper (#264)
- added asdiv (#244)
- added truthfulqa (#219)
- added gsm (#260)
- implemented description dict and deprecated provide_description (#226)
- new
--check_integrity
flag to run integrity unit tests at eval time (#290) - positional arguments to
evaluate
andsimple_evaluate
are now deprecated _CITATION
attribute on task modules (#292)- lots of bug fixes and task fixes (always remember to report task versions for comparability!)