INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Abstract
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
Community
The development of functional LLMs in many languages is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs across 44 written languages from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment (2024)
- ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding (2024)
- MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment (2024)
- Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs (2024)
- Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language (2024)
- MILU: A Multi-task Indic Language Understanding Benchmark (2024)
- Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper