Adaptive dynamic number of speculative tokens #34156

jmamou · 2024-10-14T11:01:05Z

What does this PR do?

The assistant's confidence threshold is adjusted throughout the speculative iterations to reduce the number of unnecessary draft and target forward passes. The costs are estimated based on the ROC curve, which considers the probability of the draft token and its match with the target. A cost of 25% is assigned to false positives and 75% to false negatives.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@gante @amyeroberts

gante

In addition to the comments: can you share the benchmark results here as well, for future reference?

src/transformers/generation/configuration_utils.py

src/transformers/generation/candidate_generator.py

jmamou · 2024-10-16T07:46:30Z

In addition to the comments: can you share the benchmark results here as well, for future reference?

A100
target: starcoder
draft: tiny_starcoder
dataset: MBPP

Heuristics: mean_inference_time=16.33ms
Fixed threshold for dynamic SL #33258: mean_inference_time=14.03ms
Adaptive threshold for dynamic SL (current PR): mean_inference_time=13.42ms

I will run later benchmark from https://huggingface.co/blog/dynamic_speculation_lookahead

shorter docstring Co-authored-by: Joao Gante <[email protected]>

gante · 2024-10-17T13:57:18Z

@jmamou yeah, let's please run more benchmarks before (iterating on the PR and) merging

In the odd chance it ends up being beneficial only in very specific circumstances, I'd rather not merge the technique to avoid adding complexity (which usually reduces our team's ability to work on more projects 🤗 )

jmamou · 2024-10-31T12:44:22Z

@jmamou yeah, let's please run more benchmarks before (iterating on the PR and) merging

In the odd chance it ends up being beneficial only in very specific circumstances, I'd rather not merge the technique to avoid adding complexity (which usually reduces our team's ability to work on more projects 🤗 )

@gante
I have run benchmarks from https://huggingface.co/spaces/joaogante/assisted_generation_benchmarks
https://github.com/gante/huggingface-demos/tree/main/experiments/faster_generation
Evaluated metric: throughput -- time per token in ms, lower is better
Device: RTX 3090; dtype applies to both models

Model	Assistant	dtype	task	sampling?	w/o assistant	disco 0.4	adaptive disco	disco speedup	adaptive disco speedup
openai/whisper-large-v2	openai/whisper-tiny	fp16	automatic speech recognition	no	20.02	14.59	13.81	1.37	1.45
facebook/opt-6.7b	facebook/opt-125m	bf16	summarization	no	23.81	8.73	8.72	2.73	2.73
facebook/opt-6.7b	facebook/opt-125m	bf16	summarization	yes (t=0,6)	24.21	12.01	10.55	2.02	2.29
facebook/opt-6.7b	facebook/opt-125m	bf16	open-ended generation	no	22.14	14.19	14.14	1.56	1.57
facebook/opt-6.7b	facebook/opt-125m	bf16	open-ended generation	yes (t=0,7)	22.13	14.16	14.09	1.56	1.57
Salesforce/codegen-6B-mono	Salesforce/codegen-350M-mono	bf16	code generation (python)	no	30.88	26.8	26.95	1.15	1.15
Salesforce/codegen-6B-mono	Salesforce/codegen-350M-mono	bf16	code generation (python)	yes (t=0,4)	37.02	35.88	33.79	1.03	1.1
google/flan-t5-xl	google/flan-t5-small	bf16	summarization	no	24.76	20.11	20.1	1.23	1.23
google/flan-t5-xl	google/flan-t5-small	bf16	summarization	yes (t=0,6)	24.44	26.78	25.15	0.91	0.97

Model	Assistant	dtype	task	sampling?
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	summarization	no	33.06	19.27	19.29	1.72	1.71
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	summarization	yes (t=0,6)	33.6	24.35	21.69	1.38	1.55
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	open-ended generation	no	31.25	33.2	33.1	0.94	0.94
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	open-ended generation	yes (t=0,7)	31.35	42.29	39.02	0.74	0.8
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	code generation (python)	no	27.98	19.49	19.72	1.44	1.42
meta-llama/Llama-3.1-8B	meta-llama/Llama-3.2-1B	bf16	code generation (python)	yes (t=0,4)	28.6	24.23	20.85	1.18	1.37

An improvement is observed when do_sample=True, likely because the threshold was set to 0.4 to optimize for greedy decoding. It seems that a lower threshold may be needed when sampling, highlighting the need to adapt the threshold as proposed in the PR ...

gante · 2024-11-04T10:22:10Z

@jmamou I'm convinced :D The benchmarks do show a consistent upgrade

gante

LGTM, thank you for the thorough benchmark 🤗

src/transformers/generation/candidate_generator.py

jmamou · 2024-11-20T09:59:28Z

@ArthurZucker could you please review it?

ArthurZucker

LGTM! It's just missing a test !

ArthurZucker

LGTM! It's just missing a test !

jmamou · 2024-12-04T11:05:15Z

LGTM! It's just missing a test !

@ArthurZucker
done!

ArthurZucker · 2024-12-05T14:27:58Z

from transformers.generation.candidate_generator import AssistedCandidateGenerator needs to protect it's import to torch! candidate_generator.py needs to check python availability!

ArthurZucker · 2024-12-05T16:07:30Z

Let's GOOOOOOO! 🚀

jmamou added 9 commits October 7, 2024 03:06

initial commit

4899591

Merge branch 'main' into adaptive-SL

3468e4c

update strategy

83f6219

add tradeoff FPR TPR with cost

c12b8d1

all probs

0703ec2

Merge branch 'main' into adaptive-SL

f0494c9

fix

25d76fd

fix

be87f25

fix style

31412c6

gante reviewed Oct 15, 2024

View reviewed changes

src/transformers/generation/configuration_utils.py Outdated Show resolved Hide resolved

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

Update src/transformers/generation/configuration_utils.py

53d5d0a

shorter docstring Co-authored-by: Joao Gante <[email protected]>

jmamou added 3 commits October 30, 2024 08:07

import guard

1082497

Merge remote-tracking branch 'origin/main' into adaptive-SL

471e7f5

fix style

23bf170

Merge branch 'main' into adaptive-SL

01809b7

gante approved these changes Nov 4, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

gante requested a review from ArthurZucker November 4, 2024 10:34

jmamou added 2 commits November 4, 2024 04:45

add is_sklearn_available condition

a5d2c30

Merge remote-tracking branch 'origin/main' into adaptive-SL

2106fd3

keyboardAnt reviewed Nov 17, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

jmamou added 4 commits November 19, 2024 02:07

vectorizing to flatten the for-loop

a5db3ef

Merge branch 'main' into adaptive-SL

ead02fc

fix style

88da9f0

Merge branch 'main' into adaptive-SL

bdd5afb

Merge branch 'main' into adaptive-SL

67ff646

jmamou added 2 commits November 20, 2024 03:51

disable adaptation for UAG

cc423ba

update doc

11bbe3a

ArthurZucker reviewed Nov 22, 2024

View reviewed changes

jmamou added 5 commits November 25, 2024 16:44

Merge branch 'main' into adaptive-SL

daa265e

Merge branch 'main' into adaptive-SL

b1ee28f

Merge branch 'main' into adaptive-SL

18a1e01

Merge branch 'main' into adaptive-SL

9f7fdbd

add TestAssistedCandidateGeneratorUpdateStrategy

8bba3f3

jmamou added 2 commits December 5, 2024 14:52

Merge branch 'main' into adaptive-SL

eae619b

fix style

2de6484

jmamou added 2 commits December 5, 2024 06:52

protect import

d0294c4

fix style

417ac46

ArthurZucker merged commit e27465c into huggingface:main Dec 5, 2024
22 checks passed

jmamou deleted the adaptive-SL branch May 28, 2025 13:00

Adaptive dynamic number of speculative tokens #34156

Adaptive dynamic number of speculative tokens #34156

Uh oh!

Conversation

jmamou commented Oct 14, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmamou commented Oct 16, 2024

Uh oh!

gante commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmamou commented Oct 31, 2024

Uh oh!

gante commented Nov 4, 2024

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmamou commented Nov 20, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

jmamou commented Dec 4, 2024

Uh oh!

ArthurZucker commented Dec 5, 2024

Uh oh!

ArthurZucker commented Dec 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gante commented Oct 17, 2024 •

edited

Loading