[WIP] - Enable speculative decoding with batch size >1 #32189

kamilakesbi · 2024-07-24T12:35:20Z

What does this PR do?

This PR aims at solving issue #32165.

I've started adapting code to enable speculative decoding with batch_size >1. I've reused some of the work in former PR #26875.

Main steps of the solution:

When batch size > 1:

Compute the number of similar tokens between candidate tokens and tokens obtained after doing a forward pass with the main model. This results in a tensor n_matches with the number of matches for each sequence in the batch.
We keep the matching tokens. For that, we keep all tokens from the output to the main model with a sequence position inferior to n_matches.max() + 1. In doing so, we also retain some potential mismatched tokens, which we will deal with in the next steps using padding tokens. The resulting tensor, input_ids, is thus in the form (batch_size, n_matches + 1).
We shift each sequence i in input_ids by n_matches.max() - n_matches[i]. The matching tokens are displaced to the right of input_ids and n_matches.max() - n_matches[i] padding tokens are added to the left.
Left cut: We cut all columns that contain only padding tokens. By design, these columns are to the left of the input_ids. In this way we keep the smallest possible `input_ids' that contain all the information needed to continue assisted generation.

Steps 1 to 4 are the main addition to the the original speculative decoding loop described in detail in this blog to enable assisted generation with BS > 1.

To make this work, we also need to adapt the computation of the attention_masks, past_key_values and position_ids to take into account the shifted positions of the generated tokens.

To do:

For now, I want to make this work with Whisper using this snippet.

I've implemented steps 1 to 4 and adapted the computation of the attention_masks and past_key_values to handle the new padding tokens.

I still need to make some adaptations with the position_ids to make this work properly. From what I can see:

For the main model (here WhisperForConditionalGeneration), position_ids are inferred directly from the attention_mask as we can see here. So if we pass the right attention mask to generate we should be good.
For the assistant model (WhisperForCausalLM in our example), position_ids are currently not computed nor passed to prepare_inputs_for_generation, which I'm not sure exactly why. I've done a first attempt at solving this with no success so far.

cc @sanchit-gandhi @gante

HuggingFaceDocBuilderDev · 2024-07-24T12:55:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-07-27T11:04:54Z

Awesome 🙌 plz ping me when you have questions or when the PR is ready!

Don't forget to add tests, and, if possible, benchmarks in the PR for future reference 🙏

xu1998hz · 2024-07-28T00:19:48Z

Very interested in this! Please ping me when it is done

deafTim · 2024-10-03T13:05:57Z

Do you have something new?

deafTim · 2024-10-03T13:13:35Z

I have an error
@ylacombe

  File "D:\_LLM_project\Development\python311envassistbatchkamila\Lib\site-packages\transformers\modeling_attn_mask_utils.py", line 137, in to_4d
    expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (32) must match the size of tensor b (28) at non-singleton dimension 3

deafTim · 2024-10-08T10:49:19Z

@ylacombe
Could you please give an advice have I can fix that?

LysandreJik · 2024-10-09T10:30:23Z

cc @ylacombe regarding this PR

deafTim · 2024-10-09T15:16:34Z

cc @ylacombe regarding this PR

thanks

deafTim · 2024-10-09T15:17:16Z

@ylacombe
Could you help, please?

deafTim · 2024-10-11T10:29:08Z

gpt 2 models and whisper model don't work

trotsky1997 · 2024-12-11T15:16:22Z

Any update?

first changes

bad702a

kamilakesbi changed the title ~~[WIP] - Enable speculative decoding with batch size >1 #32165~~ [WIP] - Enable speculative decoding with batch size >1 Jul 24, 2024

kamilakesbi added 2 commits July 24, 2024 15:14

make sytle

bd52eca

up

ec05eb3

kamilakesbi added 5 commits July 29, 2024 12:27

fix past_key_values computation

3b183b8

fix attention mask

ffff46f

make style

09adcb4

update attention mask

05b203a

add position ids to WhisperForCausalLM

9ddde23

kamilakesbi requested a review from sanchit-gandhi July 29, 2024 16:55

trotsky1997 approved these changes Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] - Enable speculative decoding with batch size >1 #32189

[WIP] - Enable speculative decoding with batch size >1 #32189

kamilakesbi commented Jul 24, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 24, 2024

gante commented Jul 27, 2024

xu1998hz commented Jul 28, 2024

deafTim commented Oct 3, 2024

deafTim commented Oct 3, 2024 •

edited

Loading

deafTim commented Oct 8, 2024 •

edited

Loading

LysandreJik commented Oct 9, 2024

deafTim commented Oct 9, 2024

deafTim commented Oct 9, 2024

deafTim commented Oct 11, 2024

trotsky1997 commented Dec 11, 2024

[WIP] - Enable speculative decoding with batch size >1 #32189

Are you sure you want to change the base?

[WIP] - Enable speculative decoding with batch size >1 #32189

Conversation

kamilakesbi commented Jul 24, 2024 • edited Loading

What does this PR do?

Main steps of the solution:

To do:

HuggingFaceDocBuilderDev commented Jul 24, 2024

gante commented Jul 27, 2024

xu1998hz commented Jul 28, 2024

deafTim commented Oct 3, 2024

deafTim commented Oct 3, 2024 • edited Loading

deafTim commented Oct 8, 2024 • edited Loading

LysandreJik commented Oct 9, 2024

deafTim commented Oct 9, 2024

deafTim commented Oct 9, 2024

deafTim commented Oct 11, 2024

trotsky1997 commented Dec 11, 2024

kamilakesbi commented Jul 24, 2024 •

edited

Loading

deafTim commented Oct 3, 2024 •

edited

Loading

deafTim commented Oct 8, 2024 •

edited

Loading