-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Adaptive dynamic number of speculative tokens #34156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments: can you share the benchmark results here as well, for future reference?
A100 Heuristics: mean_inference_time=16.33ms I will run later benchmark from https://huggingface.co/blog/dynamic_speculation_lookahead |
shorter docstring Co-authored-by: Joao Gante <[email protected]>
|
@jmamou yeah, let's please run more benchmarks before (iterating on the PR and) merging In the odd chance it ends up being beneficial only in very specific circumstances, I'd rather not merge the technique to avoid adding complexity (which usually reduces our team's ability to work on more projects 🤗 ) |
@gante
An improvement is observed when |
|
@jmamou I'm convinced :D The benchmarks do show a consistent upgrade |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the thorough benchmark 🤗
|
@ArthurZucker could you please review it? |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! It's just missing a test !
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! It's just missing a test !
@ArthurZucker |
|
|
|
Let's GOOOOOOO! 🚀 |
What does this PR do?
Following
#33258
#33657
The assistant's confidence threshold is adjusted throughout the speculative iterations to reduce the number of unnecessary draft and target forward passes. The costs are estimated based on the ROC curve, which considers the probability of the draft token and its match with the target. A cost of 25% is assigned to false positives and 75% to false negatives.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante @amyeroberts