Skip to content

Conversation

@ca16
Copy link
Collaborator

@ca16 ca16 commented Aug 20, 2025

Related to the relevant slack convo.

Note: there are a couple of other things we want to do in this area:

  1. For a specific submission, use a different name than what this mapping would produce.
  2. Consolidate model name display adjustments, probably mostly about https://github.com/allenai/asta-bench-leaderboard/blob/1e64d2b6fd0cd9f089c32a652cc6b8df1c3d7cb0/leaderboard_transformer.py#L705.

This PR doesn't do either of those, I'll handle those separately.

Testing done:

With this change, some relevant entries after doing lb view:

                              id                          Agent                                                                          Agent (with models)                                                                                                                          Agent description User/organization Submission date                                                                                                                    Logs                                                           Source                      Openness    Agent tooling                                                                  LLM base  Overall  Overall cost  lit score  lit cost  data score  data cost  code score  code cost  discovery score  discovery cost  Overall frontier  lit frontier  data frontier  code frontier  discovery frontier
...
2025-08-14 18:53:46.954659+00:00               Smolagents Coder                                                           Smolagents Coder (GPT-5 (2025-08))       HuggingFace Smolagents CodeAgent (v1.17.0). This variant uses OpenAI GPT-5 with default reasoning-effort (medium) as the base model.               Ai2      2025-08-14              hf://datasets/allenai/asta-bench-submissions/1.0.0-dev1/test/miked-ai_Smolagents-GPT-5_2025-08-14T18-53-46               https://github.com/allenai/asta-bench/tree/038c7bf  Open source & closed weights Custom interface                                                         [GPT-5 (2025-08)] 0.371127      0.127240   0.442604  0.117891    0.267289   0.077001    0.309304   0.095848         0.465313        0.218221              True         False          False           True                True
...
2025-08-14 18:53:01.026721+00:00               Smolagents Coder                                                      Smolagents Coder (GPT-5 Mini (2025-08))  HuggingFace Smolagents CodeAgent (v1.17.0). This variant uses OpenAI GPT-5-mini with default reasoning-effort (medium) as the base model.               Ai2      2025-08-14         hf://datasets/allenai/asta-bench-submissions/1.0.0-dev1/test/miked-ai_Smolagents-GPT-5-mini_2025-08-14T18-53-01               https://github.com/allenai/asta-bench/tree/038c7bf  Open source & closed weights Custom interface                                                    [GPT-5 Mini (2025-08)] 0.282295      0.063102   0.350159  0.015214    0.276804   0.071096    0.282674   0.090240         0.219545        0.075860              True          True          False           True               False

Without this change:

                              id                          Agent                                                                          Agent (with models)                                                                                                                          Agent description User/organization Submission date                                                                                                                    Logs                                                           Source                      Openness    Agent tooling                                                                  LLM base  Overall  Overall cost  lit score  lit cost  data score  data cost  code score  code cost  discovery score  discovery cost  Overall frontier  lit frontier  data frontier  code frontier  discovery frontier
...
2025-08-14 18:53:46.954659+00:00               Smolagents Coder                                                          Smolagents Coder (gpt-5-2025-08-07)       HuggingFace Smolagents CodeAgent (v1.17.0). This variant uses OpenAI GPT-5 with default reasoning-effort (medium) as the base model.               Ai2      2025-08-14              hf://datasets/allenai/asta-bench-submissions/1.0.0-dev1/test/miked-ai_Smolagents-GPT-5_2025-08-14T18-53-46               https://github.com/allenai/asta-bench/tree/038c7bf  Open source & closed weights Custom interface                                                        [gpt-5-2025-08-07] 0.371127      0.127240   0.442604  0.117891    0.267289   0.077001    0.309304   0.095848         0.465313        0.218221              True         False          False           True                True
...
2025-08-14 18:53:01.026721+00:00               Smolagents Coder                                                     Smolagents Coder (gpt-5-mini-2025-08-07)  HuggingFace Smolagents CodeAgent (v1.17.0). This variant uses OpenAI GPT-5-mini with default reasoning-effort (medium) as the base model.               Ai2      2025-08-14         hf://datasets/allenai/asta-bench-submissions/1.0.0-dev1/test/miked-ai_Smolagents-GPT-5-mini_2025-08-14T18-53-01               https://github.com/allenai/asta-bench/tree/038c7bf  Open source & closed weights Custom interface                                                   [gpt-5-mini-2025-08-07] 0.282295      0.063102   0.350159  0.015214    0.276804   0.071096    0.282674   0.090240         0.219545        0.075860              True          True          False           True               False

Relevant bits:
image

@ca16 ca16 requested a review from jbragg August 20, 2025 15:12
@ca16 ca16 merged commit 5d7da67 into main Aug 20, 2025
4 checks passed
@ca16 ca16 deleted the chloea-simple-llm-mapping-name branch August 20, 2025 17:03
@ca16
Copy link
Collaborator Author

ca16 commented Aug 20, 2025

Published new library version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants