-
Notifications
You must be signed in to change notification settings - Fork 3
Cost freezing take two #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pyproject.toml
Outdated
| # pin litellm so that we know what model costs we're using | ||
| # see https://github.com/allenai/astabench-issues/issues/391 before changing | ||
| "litellm==1.75.8", | ||
| "litellm<=1.75.8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't go above this, to avoid silently supporting new models (see this scenario).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the dep conflict in astabench come from the sqa deps? In which case, the fundamental issue is that we're picking up their pinned version and will likely encounter errors in the future whenever they update the version in their lib
not sure what to do about it... but I do think for now we could pin to ==1.68.0 to satisfy (but still be sure of what version we're getting; we don't want 0.2.0 for example)
pyproject.toml
Outdated
| # pin litellm so that we know what model costs we're using | ||
| # see https://github.com/allenai/astabench-issues/issues/391 before changing | ||
| "litellm==1.75.8", | ||
| "litellm<=1.75.8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the dep conflict in astabench come from the sqa deps? In which case, the fundamental issue is that we're picking up their pinned version and will likely encounter errors in the future whenever they update the version in their lib
not sure what to do about it... but I do think for now we could pin to ==1.68.0 to satisfy (but still be sure of what version we're getting; we don't want 0.2.0 for example)
|
Double checking I'm following - you mean in addition to what's in this PR, pinning to |
that's what I was thinking; I suppose either way could work but we'd probably lose the costs for gpt-5 if we downgraded with the original solution, so the other changes in this PR seem good to have regardless |
Cool, yeah costs for gpt-5 are the main reason I also think we should keep what's in here. |
|
My concern with pinning is that it will create problems for any agents that depend on a specific litellm version. Ideally the pinned version is only required for scoring. |
I agree but would also note that this is already a problem; we established that we can't easily allow higher versions than the one we want to freeze to, so if sqa ever bumps their version it will break us. I think we should aim to resolve this in the long term but it doesn't seem urgent in the short term; basically, if we encounter conflicts then we will have to update the agent-eval litellm version at that time in order to fix them. But as I think about it, it probably would be good to explain in the README somewhere that when you bump the version you should also update the cost map hash. Hopefully we can streamline the process at some point but this seems like it would work for now? |
|
And looks like pinning to 1.68.0 already creates problems for asta-bench: How about this for now: When we need/want to bump to a higher version of litellm, that would be a time to update the cost file we point at like Mike says, and figure out if we want to rescore stuff. I'll add something to the readme about that. Next steps could include:
|
I guess I don't understand; a range just means the installer will choose a specific version that satisfies the requirements, but if such a version exists, can we just pin to it? (I suggested 1.68.0 since I thought that was the one we end up with, but might have misremembered) I don't have a huge objection to a range though; IIRC though the 1.68.0 bump was significant in some way (though we've now established that my memory may not be trustworthy), so should be careful if letting the lower bound fall below that |
|
Pinning to |
mdarcy220
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning to
1.67.4.post1seems to work.
Awesome; that sounds good from my perspecive
|
If @mdarcy220 doesn't have a huge objection to a range isn't it better so that astabench is compatible with more agents? |
the tradeoff being that cost calc may change in ways we don't recognize incompatibility between agenteval and agent baselines is already a problem we will have to address, e.g. if sqa ever upgrades my suggestion is to be a strict as possible right now and only back off to a range when a need is demonstrated (hopefully by then we can decouple agent-eval litellm from the agent baselines litellm) but ultimately I defer to @ca16's judgement, and I think the specific version setting isn't the most pressing thing to perfect by Tuesday; a range is fine with me if it simplifies resolution |
|
I've written the docs to work with either scenario. For right now, I think I lean towards going with Mike's suggestion, at least until we figure out how we want to store relevant information along with computed costs... |
|
summary after pinning to 1.67.4.post1: |
| desired_model_costs_keys = set(desired_model_costs.keys()) | ||
| in_current_not_in_desired = current_model_cost_keys - desired_model_costs_keys | ||
| if len(in_current_not_in_desired) > 0: | ||
| click.echo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: made this a warning instead of an error because it looks like the assumption that later versions of the file would wouldn't drop entries moving forward was wrong (though maybe it's also related to the key thing in the PR description). Anyway, message looks like:
WARNING: Info for {'sambanova/Qwen2.5-72B-Instruct', 'accounts/fireworks/models/llama-v3p2-90b-vision-instruct', 'claude-instant-1.2', 'fireworks-ai-16.1b-to-80b', 'fireworks-ai-up-to-16b', 'sambanova/Meta-Llama-3.1-70B-Instruct', 'mistralai/mistral-small-3.1-24b-instruct', 'voyage/voyage-01', 'sambanova/Qwen2.5-Coder-32B-Instruct', 'claude-2', 'claude-2.1', 'claude-3-sonnet-20240229', 'claude-instant-1', 'cerebras/llama3.3-70b'} is available but not from the specified cost map!
(I think we've already established it's the case that we need to know the litellm version to be able to reconstruct the right cost file, so I don't think this changes much.)
|
Published a new library version. |
Related to https://github.com/allenai/astabench-issues/issues/391.
We started with #63, but that won't work because it results in dependencies that aren't satisfiable in asta-bench. I think this fixes it though.
The idea builds on what was introduced in #63. We want to avoid automatically pulling new cost info in on the fly, so the code still expects
LITELLM_LOCAL_MODEL_COST_MAPto be set to true. This means that we'll look at the local cost file.The local cost file can change depending on the version of litellm we have. #63 pinned the version of litellm so that we would always be using the same version of the local cost file. But we can't do that because of asta-bench's requirements (specifically, one of the dependencies wants a lower version of litellm). This PR tries to address that by doing two things:
register_model()with a specific version of the local cost file (the version in litellm's version 1.75.8) - this means litellm will sort of merge the local cost file with the 1.75.8 version of the local cost file, with the 1.75.8 version info getting used when there's conflicts. The idea here is to get information up to 1.75.8's version (the merging process is a little more complicated though, see the litellm.model_cost diff info further down).litellm<=1.75.8- this should allow for satisfiable dependencies for asta-bench, and also prevent us from pulling in cost info for models not represented in 1.75.8's local cost file.Testing done:
I tried scoring
hf://allenai/asta-bench-internal-submissions/1.0.0-dev1/test/miked-ai_ReAct-GPT-5-mini_2025-08-11T16-35-18with litellm versions 1.75.8 and 1.75.0 (1.75.0 is I think before costs were introduced for gpt 5, which that submission uses. Testing in #63 established that using just the local cost file from 1.75.0 results in null cost results.)With litellm version 1.75.8:
summary stats:
Details
With litellm version 1.75.0:
Summary stats:
Details
I also tried looking at litellm.model_cost in both cases... It's not the same.
"vertex_ai/claude-opus-4", we dropinput_cost_per_token_batchesandoutput_cost_per_token_batches. In the merged model costs for 1.75.0, these have valuesinput_cost_per_token_batches=7.5e-06andoutput_cost_per_token_batches=3.75e-05. They have null values in the merged model costs for 1.75.8. They don't appear at all in the local files for either 1.75.0 or 1.75.8. Not sure what's going on there...vertex_ai/claude-opus-4-1has an entry in the merged costs for 1.75.8 but not in the merged costs for 1.75.0. A little context... When two model cost dicts get merged, it seems like the dict key isn't always what determines the key by which we figure out which entries to merge, see this... The local file for 1.75.0 has nothing forvertex_ai/claude-opus-4-1, and the local file for 1.75.8 hasvertex_ai/claude-opus-4-1andvertex_ai/claude-opus-4-1@20250805. I think what maybe happened here is that in 1.75.0, both thevertex_ai/claude-opus-4-1andvertex_ai/claude-opus-4-1@20250805got mapped to the same key, and the second one won out.So basically... I think this is imperfect but I'm not sure we can do much better given that going through the merging process seems unavoidable unless you either allow for using the remote file, or pin litellm and don't allow for using the remote file (both of which don't work for us currently). So I think maybe the best we can do rn is something like this and also print out the litellm version the code is on. Between that and the agent-eval commit, we'll have the litellm version and the cost file we're pointing at for register_model, which I think should be enough to reconstruct the cost file used (though not necessarily enough to automatically tell us whether two results with different values here are still comparable cost wise). Ideally we'd save the litellm version somewhere alongside the scores, but maybe we can deal with that in the next iteration...