- [2024/06] LiveBench: A Challenging, Contamination-Free LLM Benchmark
- [2024/06] VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
- [2024/06] Benchmark Data Contamination of Large Language Models: A Survey
- [2024/06] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
- [2024/04] Benchmarking Benchmark Leakage in Large Language Models
- [2024/03] Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
- [2024/03] Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
- [2024/02] Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
- [2024/01] KoLA: Carefully Benchmarking World Knowledge of Large Language Models
- [2023/09] Proving Test Set Contamination for Black-Box Language Models
- [2023/09] Time Travel in LLMs: Tracing Data Contamination in Large Language Models
- [2023/09] To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination
- [2023/09] DyVal: Graph-informed Dynamic Evaluation of Large Language Models