Cost freezing take two #66

ca16 · 2025-08-21T13:13:06Z

Related to https://github.com/allenai/astabench-issues/issues/391.

We started with #63, but that won't work because it results in dependencies that aren't satisfiable in asta-bench. I think this fixes it though.

The idea builds on what was introduced in #63. We want to avoid automatically pulling new cost info in on the fly, so the code still expects LITELLM_LOCAL_MODEL_COST_MAP to be set to true. This means that we'll look at the local cost file.

The local cost file can change depending on the version of litellm we have. #63 pinned the version of litellm so that we would always be using the same version of the local cost file. But we can't do that because of asta-bench's requirements (specifically, one of the dependencies wants a lower version of litellm). This PR tries to address that by doing two things:

use register_model() with a specific version of the local cost file (the version in litellm's version 1.75.8) - this means litellm will sort of merge the local cost file with the 1.75.8 version of the local cost file, with the 1.75.8 version info getting used when there's conflicts. The idea here is to get information up to 1.75.8's version (the merging process is a little more complicated though, see the litellm.model_cost diff info further down).
- This does assume that situations like the following won't happen: there's a model we'd want to score that only has cost info in a local cost file corresponding to older versions of litellm.
loosen the requirement on litellm version to litellm<=1.75.8- this should allow for satisfiable dependencies for asta-bench, and also prevent us from pulling in cost info for models not represented in 1.75.8's local cost file.

Testing done:
I tried scoring hf://allenai/asta-bench-internal-submissions/1.0.0-dev1/test/miked-ai_ReAct-GPT-5-mini_2025-08-11T16-35-18 with litellm versions 1.75.8 and 1.75.0 (1.75.0 is I think before costs were introduced for gpt 5, which that submission uses. Testing in #63 established that using just the local cost file from 1.75.0 results in null cost results.)

With litellm version 1.75.8:

Model costs hash 440711902e484e4eec0b5c3409d3b953fb2164338d84269ac4ecf54694b30aeb.
litellm version: 1.75.8

summary stats:

Details

  "stats": {
    "overall": {
      "score": 0.3148783028247495,
      "score_stderr": null,
      "cost": 0.03517972139633324,
      "cost_stderr": null
    },
    "tag/lit": {
      "score": 0.359073892718989,
      "score_stderr": null,
      "cost": 0.0432097658085206,
      "cost_stderr": null
    },
    "tag/data": {
      "score": 0.2692091729748633,
      "score_stderr": null,
      "cost": 0.011127473221757321,
      "cost_stderr": null
    },
    "tag/code": {
      "score": 0.5052960102960103,
      "score_stderr": null,
      "cost": 0.05169765155505505,
      "cost_stderr": null
    },
    "tag/discovery": {
      "score": 0.12593413530913533,
      "score_stderr": null,
      "cost": 0.034683994999999995,
      "cost_stderr": null
    },
    "task/paper_finder_test": {
      "score": 0.1680882137565247,
      "score_stderr": 0.01736167991412297,
      "cost": 0.0384160127340824,
      "cost_stderr": 0.004496229162562603
    },
    "task/paper_finder_litqa2_test": {
      "score": 0.6133333333333333,
      "score_stderr": 0.05661099544085763,
      "cost": 0.11436817,
      "cost_stderr": 0.017359728699967197
    },
    "task/sqa_test": {
      "score": 0.2672615497644688,
      "score_stderr": 0.03773623587370571,
      "cost": 0.0270534035,
      "cost_stderr": 0.0019422575674411598
    },
    "task/arxivdigestables_test": {
      "score": 0.3209458073549626,
      "score_stderr": 0.016795287448771803,
      "cost": 0.012757592,
      "cost_stderr": 0.000598238094979927
    },
    "task/litqa2_test": {
      "score": 0.7466666666666667,
      "score_stderr": 0.05055844297598726,
      "cost": 0.07485594,
      "cost_stderr": 0.014531272631798089
    },
    "task/discoverybench_test": {
      "score": 0.2692091729748633,
      "score_stderr": 0.024402794451474107,
      "cost": 0.011127473221757321,
      "cost_stderr": 0.0006222390294466313
    },
    "task/core_bench_test": {
      "score": 0.4594594594594595,
      "score_stderr": 0.08305895907471071,
      "cost": 0.04720825405405405,
      "cost_stderr": 0.007319669323511871
    },
    "task/ds1000_test": {
      "score": 0.71,
      "score_stderr": 0.015133811749341808,
      "cost": 0.002989897277777778,
      "cost_stderr": 0.0000549948574693956
    },
    "task/e2e_discovery_test": {
      "score": 0.09482323232323234,
      "score_stderr": 0.03868709687867958,
      "cost": 0.02972574375,
      "cost_stderr": 0.0029628950153995407
    },
    "task/e2e_discovery_hard_test": {
      "score": 0.1570450382950383,
      "score_stderr": 0.04217558424596078,
      "cost": 0.03964224625,
      "cost_stderr": 0.004099116765299048
    },
    "task/super_test": {
      "score": 0.3464285714285715,
      "score_stderr": 0.0673848147211148,
      "cost": 0.10489480333333333,
      "cost_stderr": 0.023534508573437658
    }
  }
}

With litellm version 1.75.0:

Model costs hash 58077933fa776c069575d1dae827ad3f8422739cf26c7f65d97be342e82a06d4.
litellm version: 1.75.0

Summary stats:

Details

    "overall": {
      "score": 0.3148783028247495,
      "score_stderr": null,
      "cost": 0.03517972139633324,
      "cost_stderr": null
    },
    "tag/lit": {
      "score": 0.359073892718989,
      "score_stderr": null,
      "cost": 0.0432097658085206,
      "cost_stderr": null
    },
    "tag/data": {
      "score": 0.2692091729748633,
      "score_stderr": null,
      "cost": 0.011127473221757321,
      "cost_stderr": null
    },
    "tag/code": {
      "score": 0.5052960102960103,
      "score_stderr": null,
      "cost": 0.05169765155505505,
      "cost_stderr": null
    },
    "tag/discovery": {
      "score": 0.12593413530913533,
      "score_stderr": null,
      "cost": 0.034683994999999995,
      "cost_stderr": null
    },
    "task/paper_finder_test": {
      "score": 0.1680882137565247,
      "score_stderr": 0.01736167991412297,
      "cost": 0.0384160127340824,
      "cost_stderr": 0.004496229162562603
    },
    "task/paper_finder_litqa2_test": {
      "score": 0.6133333333333333,
      "score_stderr": 0.05661099544085763,
      "cost": 0.11436817,
      "cost_stderr": 0.017359728699967197
    },
    "task/sqa_test": {
      "score": 0.2672615497644688,
      "score_stderr": 0.03773623587370571,
      "cost": 0.0270534035,
      "cost_stderr": 0.0019422575674411598
    },
    "task/arxivdigestables_test": {
      "score": 0.3209458073549626,
      "score_stderr": 0.016795287448771803,
      "cost": 0.012757592,
      "cost_stderr": 0.000598238094979927
    },
    "task/litqa2_test": {
      "score": 0.7466666666666667,
      "score_stderr": 0.05055844297598726,
      "cost": 0.07485594,
      "cost_stderr": 0.014531272631798089
    },
    "task/discoverybench_test": {
      "score": 0.2692091729748633,
      "score_stderr": 0.024402794451474107,
      "cost": 0.011127473221757321,
      "cost_stderr": 0.0006222390294466313
    },
    "task/core_bench_test": {
      "score": 0.4594594594594595,
      "score_stderr": 0.08305895907471071,
      "cost": 0.04720825405405405,
      "cost_stderr": 0.007319669323511871
    },
    "task/ds1000_test": {
      "score": 0.71,
      "score_stderr": 0.015133811749341808,
      "cost": 0.002989897277777778,
      "cost_stderr": 0.0000549948574693956
    },
    "task/e2e_discovery_test": {
      "score": 0.09482323232323234,
      "score_stderr": 0.03868709687867958,
      "cost": 0.02972574375,
      "cost_stderr": 0.0029628950153995407
    },
    "task/e2e_discovery_hard_test": {
      "score": 0.1570450382950383,
      "score_stderr": 0.04217558424596078,
      "cost": 0.03964224625,
      "cost_stderr": 0.004099116765299048
    },
    "task/super_test": {
      "score": 0.3464285714285715,
      "score_stderr": 0.0673848147211148,
      "cost": 0.10489480333333333,
      "cost_stderr": 0.023534508573437658
    }
  }
}

I also tried looking at litellm.model_cost in both cases... It's not the same.

Some of the difference is explicitly listing fields for which the values are None. My best guess for when this happens looking at the register_model() code is when both the local cost file and the cost file we're pointing at have an entry for the same key. Then we start with the dict version of Modelnfo representing the local cost file entry (which I think is where the fields with null values get introduced), and update it with anything in the entry from the cost file we're pointing at. I think this is probably fine.
supported_openai_params are sometimes listed. Here too, my best guess for when this happens is when both the local cost file and the cost file we're pointing to have the same key. I don't see supported_openai_params in either local cost file, but I think it gets dumped when we start with the dict version of ModelInfo representing the local cost file entry when we have entries from both places for the same key. Probably okay too assuming these don't really change?
Some entries get a new 'key' field. I think this is a similar situation to the other two - this field exists in ModelInfo and so gets pulled in when entries from both places have the same key. But it's not explicitly listed in the local cost files.
for "vertex_ai/claude-opus-4", we drop input_cost_per_token_batches and output_cost_per_token_batches. In the merged model costs for 1.75.0, these have values input_cost_per_token_batches=7.5e-06 and output_cost_per_token_batches=3.75e-05. They have null values in the merged model costs for 1.75.8. They don't appear at all in the local files for either 1.75.0 or 1.75.8. Not sure what's going on there...
vertex_ai/claude-opus-4-1 has an entry in the merged costs for 1.75.8 but not in the merged costs for 1.75.0. A little context... When two model cost dicts get merged, it seems like the dict key isn't always what determines the key by which we figure out which entries to merge, see this... The local file for 1.75.0 has nothing for vertex_ai/claude-opus-4-1, and the local file for 1.75.8 has vertex_ai/claude-opus-4-1 and vertex_ai/claude-opus-4-1@20250805. I think what maybe happened here is that in 1.75.0, both the vertex_ai/claude-opus-4-1 and vertex_ai/claude-opus-4-1@20250805 got mapped to the same key, and the second one won out.

So basically... I think this is imperfect but I'm not sure we can do much better given that going through the merging process seems unavoidable unless you either allow for using the remote file, or pin litellm and don't allow for using the remote file (both of which don't work for us currently). So I think maybe the best we can do rn is something like this and also print out the litellm version the code is on. Between that and the agent-eval commit, we'll have the litellm version and the cost file we're pointing at for register_model, which I think should be enough to reconstruct the cost file used (though not necessarily enough to automatically tell us whether two results with different values here are still comparable cost wise). Ideally we'd save the litellm version somewhere alongside the scores, but maybe we can deal with that in the next iteration...

ca16 · 2025-08-21T13:14:21Z

pyproject.toml

  # pin litellm so that we know what model costs we're using
  # see https://github.com/allenai/astabench-issues/issues/391 before changing
-  "litellm==1.75.8",
+  "litellm<=1.75.8",


Don't go above this, to avoid silently supporting new models (see this scenario).

does the dep conflict in astabench come from the sqa deps? In which case, the fundamental issue is that we're picking up their pinned version and will likely encounter errors in the future whenever they update the version in their lib

not sure what to do about it... but I do think for now we could pin to ==1.68.0 to satisfy (but still be sure of what version we're getting; we don't want 0.2.0 for example)

mdarcy220 · 2025-08-21T19:39:52Z

pyproject.toml

  # pin litellm so that we know what model costs we're using
  # see https://github.com/allenai/astabench-issues/issues/391 before changing
-  "litellm==1.75.8",
+  "litellm<=1.75.8",


does the dep conflict in astabench come from the sqa deps? In which case, the fundamental issue is that we're picking up their pinned version and will likely encounter errors in the future whenever they update the version in their lib

not sure what to do about it... but I do think for now we could pin to ==1.68.0 to satisfy (but still be sure of what version we're getting; we don't want 0.2.0 for example)

ca16 · 2025-08-21T19:45:45Z

Double checking I'm following - you mean in addition to what's in this PR, pinning to 1.68.0?

mdarcy220 · 2025-08-21T19:49:58Z

Double checking I'm following - you mean in addition to what's in this PR, pinning to 1.68.0?

that's what I was thinking; I suppose either way could work but we'd probably lose the costs for gpt-5 if we downgraded with the original solution, so the other changes in this PR seem good to have regardless

ca16 · 2025-08-21T19:55:37Z

that's what I was thinking; I suppose either way could work but we'd probably lose the costs for gpt-5 if we downgraded with the original solution, so the other changes in this PR seem good to have regardless

Cool, yeah costs for gpt-5 are the main reason I also think we should keep what's in here.

jbragg · 2025-08-21T19:59:02Z

My concern with pinning is that it will create problems for any agents that depend on a specific litellm version. Ideally the pinned version is only required for scoring.

mdarcy220 · 2025-08-21T20:13:49Z

My concern with pinning is that it will create problems for any agents that depend on a specific litellm version. Ideally the pinned version is only required for scoring.

I agree but would also note that this is already a problem; we established that we can't easily allow higher versions than the one we want to freeze to, so if sqa ever bumps their version it will break us. I think we should aim to resolve this in the long term but it doesn't seem urgent in the short term; basically, if we encounter conflicts then we will have to update the agent-eval litellm version at that time in order to fix them.

But as I think about it, it probably would be good to explain in the README somewhere that when you bump the version you should also update the cost map hash. Hopefully we can streamline the process at some point but this seems like it would work for now?

ca16 · 2025-08-21T20:25:17Z

And looks like pinning to 1.68.0 already creates problems for asta-bench:

#21 4.410   × No solution found when resolving dependencies for split (included:
#21 4.410   │ astabench[futurehouse], astabench[storm]; excluded: astabench[smolagents],
#21 4.410   │ astabench[sqa]):
#21 4.410   ╰─▶ Because futurehouse-client==0.3.19 depends on litellm==1.67.4.post1 and
#21 4.410       astabench[futurehouse] depends on futurehouse-client==0.3.19, we can
#21 4.410       conclude that astabench[futurehouse] depends on litellm==1.67.4.post1.
#21 4.410       And because your project depends on litellm==1.68.0 and your project
#21 4.410       requires astabench[futurehouse], we can conclude that your project's
#21 4.410       requirements are unsatisfiable.

How about this for now:
What's in this PR, but we pin a range instead of a specific version, at the top of which is whatever version corresponds to the cost file we're pointing at, and at the bottom of which is whatever the lowest version that's acceptable to us is if any (is there any)?

When we need/want to bump to a higher version of litellm, that would be a time to update the cost file we point at like Mike says, and figure out if we want to rescore stuff. I'll add something to the readme about that.

Next steps could include:

better tracking and displaying of information around how something was rescored
maybe some docs around guidelines for when its appropriate to rescore everything (I won't make this concrete in the docs I add for now - seems like it's not a launch blocker to figure out the guidelines)
maybe refactor stuff such that we could pin litellm just for scoring

mdarcy220 · 2025-08-21T20:30:13Z

we pin a range instead of a specific version

I guess I don't understand; a range just means the installer will choose a specific version that satisfies the requirements, but if such a version exists, can we just pin to it? (I suggested 1.68.0 since I thought that was the one we end up with, but might have misremembered)

I don't have a huge objection to a range though; IIRC though the 1.68.0 bump was significant in some way (though we've now established that my memory may not be trustworthy), so should be careful if letting the lower bound fall below that

ca16 · 2025-08-21T22:10:49Z

Pinning to 1.67.4.post1 seems to work.

mdarcy220

Pinning to 1.67.4.post1 seems to work.

Awesome; that sounds good from my perspecive

jbragg · 2025-08-21T22:18:26Z

If @mdarcy220 doesn't have a huge objection to a range isn't it better so that astabench is compatible with more agents?

mdarcy220 · 2025-08-21T22:37:33Z

If @mdarcy220 doesn't have a huge objection to a range isn't it better so that astabench is compatible with more agents?

the tradeoff being that cost calc may change in ways we don't recognize

incompatibility between agenteval and agent baselines is already a problem we will have to address, e.g. if sqa ever upgrades

my suggestion is to be a strict as possible right now and only back off to a range when a need is demonstrated (hopefully by then we can decouple agent-eval litellm from the agent baselines litellm)

but ultimately I defer to @ca16's judgement, and I think the specific version setting isn't the most pressing thing to perfect by Tuesday; a range is fine with me if it simplifies resolution

ca16 · 2025-08-21T22:43:09Z

I've written the docs to work with either scenario. For right now, I think I lean towards going with Mike's suggestion, at least until we figure out how we want to store relevant information along with computed costs...

ca16 · 2025-08-21T23:06:08Z

summary after pinning to 1.67.4.post1:

      "score": 0.3148783028247495,
      "score_stderr": null,
      "cost": 0.03517972139633324,
      "cost_stderr": null
    },
    "tag/lit": {
      "score": 0.359073892718989,
      "score_stderr": null,
      "cost": 0.0432097658085206,
      "cost_stderr": null
    },
    "tag/data": {
      "score": 0.2692091729748633,
      "score_stderr": null,
      "cost": 0.011127473221757321,
      "cost_stderr": null
    },
    "tag/code": {
      "score": 0.5052960102960103,
      "score_stderr": null,
      "cost": 0.05169765155505505,
      "cost_stderr": null
    },
    "tag/discovery": {
      "score": 0.12593413530913533,
      "score_stderr": null,
      "cost": 0.034683994999999995,
      "cost_stderr": null
    },
    "task/paper_finder_test": {
      "score": 0.1680882137565247,
      "score_stderr": 0.01736167991412297,
      "cost": 0.0384160127340824,
      "cost_stderr": 0.004496229162562603
    },
    "task/paper_finder_litqa2_test": {
      "score": 0.6133333333333333,
      "score_stderr": 0.05661099544085763,
      "cost": 0.11436817,
      "cost_stderr": 0.017359728699967197
    },
    "task/sqa_test": {
      "score": 0.2672615497644688,
      "score_stderr": 0.03773623587370571,
      "cost": 0.0270534035,
      "cost_stderr": 0.0019422575674411598
    },
    "task/arxivdigestables_test": {
      "score": 0.3209458073549626,
      "score_stderr": 0.016795287448771803,
      "cost": 0.012757592,
      "cost_stderr": 0.000598238094979927
    },
    "task/litqa2_test": {
      "score": 0.7466666666666667,
      "score_stderr": 0.05055844297598726,
      "cost": 0.07485594,
      "cost_stderr": 0.014531272631798089
    },
    "task/discoverybench_test": {
      "score": 0.2692091729748633,
      "score_stderr": 0.024402794451474107,
      "cost": 0.011127473221757321,
      "cost_stderr": 0.0006222390294466313
    },
    "task/core_bench_test": {
      "score": 0.4594594594594595,
      "score_stderr": 0.08305895907471071,
      "cost": 0.04720825405405405,
      "cost_stderr": 0.007319669323511871
    },
    "task/ds1000_test": {
      "score": 0.71,
      "score_stderr": 0.015133811749341808,
      "cost": 0.002989897277777778,
      "cost_stderr": 0.0000549948574693956
    },
    "task/e2e_discovery_test": {
      "score": 0.09482323232323234,
      "score_stderr": 0.03868709687867958,
      "cost": 0.02972574375,
      "cost_stderr": 0.0029628950153995407
    },
    "task/e2e_discovery_hard_test": {
      "score": 0.1570450382950383,
      "score_stderr": 0.04217558424596078,
      "cost": 0.03964224625,
      "cost_stderr": 0.004099116765299048
    },
    "task/super_test": {
      "score": 0.3464285714285715,
      "score_stderr": 0.0673848147211148,
      "cost": 0.10489480333333333,
      "cost_stderr": 0.023534508573437658
    }
  }
}

ca16 · 2025-08-21T23:16:15Z

src/agenteval/cli.py

+    desired_model_costs_keys = set(desired_model_costs.keys())
+    in_current_not_in_desired = current_model_cost_keys - desired_model_costs_keys
+    if len(in_current_not_in_desired) > 0:
+        click.echo(


Note: made this a warning instead of an error because it looks like the assumption that later versions of the file would wouldn't drop entries moving forward was wrong (though maybe it's also related to the key thing in the PR description). Anyway, message looks like:

WARNING: Info for {'sambanova/Qwen2.5-72B-Instruct', 'accounts/fireworks/models/llama-v3p2-90b-vision-instruct', 'claude-instant-1.2', 'fireworks-ai-16.1b-to-80b', 'fireworks-ai-up-to-16b', 'sambanova/Meta-Llama-3.1-70B-Instruct', 'mistralai/mistral-small-3.1-24b-instruct', 'voyage/voyage-01', 'sambanova/Qwen2.5-Coder-32B-Instruct', 'claude-2', 'claude-2.1', 'claude-3-sonnet-20240229', 'claude-instant-1', 'cerebras/llama3.3-70b'} is available but not from the specified cost map!

(I think we've already established it's the case that we need to know the litellm version to be able to reconstruct the right cost file, so I don't think this changes much.)

ca16 · 2025-08-21T23:38:48Z

Published a new library version.

ca16 added 2 commits August 20, 2025 23:03

maybe this

817746a

less than or equal

9671ca1

ca16 commented Aug 21, 2025

View reviewed changes

ca16 added 3 commits August 21, 2025 06:33

format

918f19c

print version

8476e60

comments and version

2451699

ca16 requested a review from mdarcy220 August 21, 2025 15:04

mdarcy220 approved these changes Aug 21, 2025

View reviewed changes

pin to lower version

5f553cd

mdarcy220 approved these changes Aug 21, 2025

View reviewed changes

ca16 added 3 commits August 21, 2025 15:35

point to docs

1fe72b9

point to docs

1b4bf3e

doc update

1958e5b

pin to something that works

e4a94c7

drop expectation

90f5dae

format

5dbaa9b

ca16 commented Aug 21, 2025

View reviewed changes

bump version

d12b8de

ca16 merged commit 4fdc118 into main Aug 21, 2025
4 checks passed

ca16 deleted the chloea-score-freezing-take-two branch August 21, 2025 23:27

This was referenced Aug 21, 2025

Bump agent eval version to 0.1.41 allenai/asta-bench#107

Merged

Cost freezing take 3 #67

Merged

Cost freezing take two #66

Cost freezing take two #66

Uh oh!

Conversation

ca16 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ca16 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

mdarcy220 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

jbragg commented Aug 21, 2025

Uh oh!

mdarcy220 commented Aug 21, 2025

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

mdarcy220 commented Aug 21, 2025

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

mdarcy220 left a comment

Choose a reason for hiding this comment

Uh oh!

jbragg commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdarcy220 commented Aug 21, 2025

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

ca16 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ca16 commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ca16 commented Aug 21, 2025 •

edited

Loading

ca16 Aug 21, 2025 •

edited

Loading

mdarcy220 commented Aug 21, 2025 •

edited

Loading

jbragg commented Aug 21, 2025 •

edited

Loading