enable StaticCache for assisted generation #34797

yao-matrix · 2024-11-19T05:10:37Z

@gante , I implemented a version for this issue: #32946. Pls help comment, and I can iterate, thx.

yao-matrix · 2024-11-21T01:27:34Z

@gante , could you pls take a look？ thx

zucchini-nlp

@yao-matrix hey, gante is currently on a long vacation so I reviewed the PR for him. Thanks for adding support for this, Super cool work!

I left a few comments and also we'll need tests in tests/generation/test_utils.py file. I guess static cache now works with all types of candidate generators right?

zucchini-nlp · 2024-11-19T09:39:48Z

src/transformers/generation/utils.py

+                if assistant_model is not None:
+                    assistant_model._get_cache(
+                        cache_implementation=generation_config.cache_implementation,
+                        batch_size=max(generation_config.num_beams, generation_config.num_return_sequences) * batch_size,
+                        max_cache_len=max_cache_length,
+                        device=device,
+                        model_kwargs=model_kwargs,
+                    )


hmm, I think it will be called on assistant model when we call assistant.generate() so there is no need. We can only remove self.generation_config.cache_implementation = None in candidate generator

the thing is: when we leave to let assistant_model.generate which is in get_candiates to call this. since the max_new _tokens will be set to max_new_tokens = min(int(self.num_assistant_tokens), self.generation_config.max_length - new_cur_len - 1) when it's first-time called, so the cache_length will be set to int(self.num_assistant_tokens) + prompt_len, less than the actual needed cache_length max_token_length + prompt_length, and lead to assert out while generation. So, the key here is assistant model's cache length should be same as main model here. And then I can see this function has assistant_model as an argument but not used it, I think it may be here for the cases like this. That's the rational behind.

oh, i see, that makes sense. Then we can leave cache init here

src/transformers/generation/candidate_generator.py

src/transformers/cache_utils.py

zucchini-nlp

LGTM! We need some tests and then I am requesting review from the core maintainer, after that we can merge

yao-matrix · 2024-11-22T07:17:31Z

@yao-matrix hey, gante is currently on a long vacation so I reviewed the PR for him. Thanks for adding support for this, Super cool work!

I left a few comments and also we'll need tests in tests/generation/test_utils.py file. I guess static cache now works with all types of candidate generators right?

@zucchini-nlp , test_utils CI pass rate is the same before and after this PR, as below. So no regressions are introduced.
before:
=========================== short test summary info ============================
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_encoder_decoder_shared_encoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_num_assistant_tokens_heuristic_schedule
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_generation_early_exit
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_custom_logits_processor
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_default_max_length_warning
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_beam_search
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_top_k_top_sampling
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generate_compile_fullgraph_tiny
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generated_length_assisted_generation
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_max_new_tokens_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_min_length_if_input_embeds
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_decoder_only
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_encoder_signature_filtering
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_prepare_inputs_for_generation_decoder_llm
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_speculative_decoding_equals_regular_decoding
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_stop_sequence_stopping_criteria
====== 17 failed, 51 passed, 19 skipped, 13 warnings in 133.78s (0:02:13) ======

after:
=========================== short test summary info ============================
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_encoder_decoder_shared_encoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_num_assistant_tokens_heuristic_schedule
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_generation_early_exit
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_custom_logits_processor
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_default_max_length_warning
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_beam_search
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_top_k_top_sampling
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generate_compile_fullgraph_tiny
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generated_length_assisted_generation
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_max_new_tokens_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_min_length_if_input_embeds
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_decoder_only
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_encoder_signature_filtering
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_prepare_inputs_for_generation_decoder_llm
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_speculative_decoding_equals_regular_decoding
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_stop_sequence_stopping_criteria
====== 17 failed, 51 passed, 19 skipped, 13 warnings in 133.78s (0:02:13) ======

yao-matrix · 2024-11-22T07:18:10Z

LGTM! We need some tests and then I am requesting review from the core maintainer, after that we can merge

thx for reviewing.

zucchini-nlp · 2024-11-22T07:24:25Z

@yao-matrix no worries is some tests are failing and are not related to PR changes. Might be just flaky or will be fixed on main by us. From what I see the only CI test affected by PR is this one + need to see if new test passes for all models

tests/models/gemma2/test_modeling_gemma2.py::Gemma2ModelTest::test_assisted_decoding_with_num_logits_to_keep

yao-matrix · 2024-11-26T23:32:24Z

@zucchini-nlp , any more comments for me to iterate? Thx.

zucchini-nlp · 2024-11-27T09:19:46Z

@yao-matrix no, the only thing is the CI which is failing now. I showed the relevant test in prev comment and if you can add one more test in tests/generation/test_utils.py which would test static cache with assisted generation. That is all actually

At the end you need to run make style to pass CI check on codestyle. Feel free to tag the core maintainer @ ArthurZucker for review when tests are ready and CI is green or tag me if you need help/have questions :)

tests/models/jetmoe/test_modeling_jetmoe.py

zucchini-nlp · 2024-11-29T11:09:01Z

tests/models/gemma2/test_modeling_gemma2.py

+    @parameterized.expand([(None, True), ("static", False)])
+    def test_assisted_decoding_with_num_logits_to_keep(self, cache_implementation, return_legacy_cache):
+        if cache_implementation == "static":
+            self.skipTest("Gemma2 has HybridCache which is not compatible with assisted decoding StaticCache")
+            pass
+


let's not skip entirely, but only the static_cache test, as we still need to check if assisted generation works in Gemma2 :)

Maybe it will be skipped by the model._support_static_cache as I've commented above, but if not we can skip only the test_assisted_decoding_with_num_logits_to_keep_1_static (maybe it's called a bit differently)

i switch to _supports_static_cache to skip the case. For Gemma, it's a bit different, since it's using HybridCache and claims _supports_static_cache = True, I still skip it in model test file. Will remove this skip after enable HybridCache for assisted decoding, I plan to enable it after this PR(pure StaticCache) merged, thx.

yao-matrix · 2024-12-02T06:49:37Z

@zucchini-nlp & @ArthurZucker , I cleaned the PR, pls help review. I plan to enable SlidingWindowCache and HybridCache as next step, thx.

zucchini-nlp

Thanks a lot! LGTM so let's wait for the core maintainers review and we'll merge. The review might take a while, as the core maintainer will be mostly off this week

ArthurZucker

Looks very nice, but we need to add a compile test to make sure this is compile compatible! The whole point of static cache is -> compile! 🤗

yao-matrix · 2024-12-11T08:14:32Z

Looks very nice, but we need to add a compile test to make sure this is compile compatible! The whole point of static cache is -> compile! 🤗

@ArthurZucker i added a test_assisted_decoding_compile case based on test_generate_compile, forward_only test pass for llama, end_to_end test fail for the same reason as Joao commented in test_generate_compile.

yao-matrix · 2024-12-13T01:19:45Z

@ArthurZucker @zucchini-nlp , pls let me know any further comments, thx. BTW, checked the failed ci case, not relevant to my changes.

zucchini-nlp · 2024-12-13T08:56:42Z

Thanks, re-triggered the tests, let's wait for the core maintainer

HuggingFaceDocBuilderDev · 2024-12-13T09:21:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yao-matrix · 2024-12-18T05:12:41Z

@ArthurZucker , @zucchini-nlp , I am thinking is it possible we leave this PR in 2024, :).

tests/generation/test_utils.py

yao-matrix · 2025-01-02T02:50:09Z

@zucchini-nlp @ArthurZucker , any further comments on this?

zucchini-nlp · 2025-01-06T13:09:54Z

@yao-matrix Arthur Zucker might be a bit busy, so tagging @gante to review is possible. We'll be able to merge after we get one more review

yao-matrix · 2025-01-07T02:26:34Z

@yao-matrix Arthur Zucker might be a bit busy, so tagging @gante to review is possible. We'll be able to merge after we get one more review

thx, @gante , could you pls help review? Thx.

ArthurZucker

Looks good, IMO would be better to update the test and also to make sure we use static cache for the assisted model when we use it for the parent model + auto compile in those cases!

ArthurZucker · 2025-01-07T15:46:07Z

src/transformers/cache_utils.py

+        end = seq_length + 1
+        index = torch.arange(begin, end, device=self.key_cache[0].device)
+
+        self._seen_tokens = max_length


relying on the _seen_tokens should not be relied on! the static cache does not really rely on this

ArthurZucker · 2025-01-07T15:57:30Z

tests/generation/test_utils.py

+        """
+        Tests that `.generate` is compatible with torch.compile without graph breaks, keeping the same results. Tests
+        end-to-end compilation and forward pass compilation only.
+        ⚠️ Runs two sequential generations to ensure the cache doesn't get stuck after the first compiled run! ⚠️
+        """


In terms of performances I think it would make more sense to test if we can compile the forward of the model and the forward of the assistant model instead of this! Our recent focus has been on rather bridging the gap here as compilation of generate is super super slow!

I would thus also look at the changes introduced by #34247 to have something similar for assisted model! 🤗

mklasby · 2025-05-19T04:33:08Z

@yao-matrix any plans to continue this effort? Compiling the assistant is an important use case IMO and it only really makes sense with a static cache.

comeusr · 2025-08-08T22:18:59Z

Hi, I am get the issue "ValueError: assisted generate is not supported with Static cache classes`", when try speculative decoding for Gemma family.

model = AutoModelForCausalLM.from_pretrained( args.target_model, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="sdpa" if 'gemma' in args.target_model else "flash_attention_2", device_map={"": device}, # Map all modules to the specified device )

draft_model = AutoModelForCausalLM.from_pretrained( args.draft_model, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="sdpa" if 'gemma' in args.target_model else "flash_attention_2", device_map={"": device}, # Map all modules to the specified device )

output_ids = model.generate( input_ids, attention_mask=attn_mask, max_new_tokens=args.max_tokens, do_sample=do_sample, use_cache=True, assistant_model=draft_model, output_hidden_states=True, temperature=args.temperature if args.temperature > 0 else 1.0, num_beams=1, )

yao-matrix marked this pull request as draft November 19, 2024 07:28

yao-matrix force-pushed the main branch from c79411d to 980aa08 Compare November 20, 2024 08:02

yao-matrix marked this pull request as ready for review November 20, 2024 08:02

zucchini-nlp reviewed Nov 21, 2024

View reviewed changes

zucchini-nlp approved these changes Nov 22, 2024

View reviewed changes

zucchini-nlp reviewed Nov 29, 2024

View reviewed changes

tests/models/jetmoe/test_modeling_jetmoe.py Outdated Show resolved Hide resolved

zucchini-nlp reviewed Nov 29, 2024

View reviewed changes

zucchini-nlp approved these changes Dec 2, 2024

View reviewed changes

zucchini-nlp requested a review from ArthurZucker December 5, 2024 09:04

ArthurZucker reviewed Dec 10, 2024

View reviewed changes

zucchini-nlp requested a review from ArthurZucker December 13, 2024 08:56

mklasby reviewed Dec 18, 2024

View reviewed changes

tests/generation/test_utils.py Outdated Show resolved Hide resolved

ArthurZucker approved these changes Jan 7, 2025

View reviewed changes

yao-matrix closed this Mar 5, 2025

yao-matrix force-pushed the main branch from 4822e36 to 66f29aa Compare March 5, 2025 06:57

enable StaticCache for assisted generation #34797

enable StaticCache for assisted generation #34797

Uh oh!

Conversation

yao-matrix commented Nov 19, 2024

Uh oh!

yao-matrix commented Nov 21, 2024

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

yao-matrix Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yao-matrix commented Nov 22, 2024

Uh oh!

yao-matrix commented Nov 22, 2024

Uh oh!

zucchini-nlp commented Nov 22, 2024

Uh oh!

yao-matrix commented Nov 26, 2024

Uh oh!

zucchini-nlp commented Nov 27, 2024

Uh oh!

Uh oh!

zucchini-nlp Nov 29, 2024

Choose a reason for hiding this comment

Uh oh!

yao-matrix Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

yao-matrix commented Dec 2, 2024

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

yao-matrix commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yao-matrix commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Dec 13, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Dec 13, 2024

Uh oh!

yao-matrix commented Dec 18, 2024

Uh oh!

Uh oh!

yao-matrix commented Jan 2, 2025

Uh oh!

zucchini-nlp commented Jan 6, 2025

Uh oh!

yao-matrix commented Jan 7, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment •

edited

Loading

yao-matrix commented Dec 11, 2024 •

edited

Loading

yao-matrix commented Dec 13, 2024 •

edited

Loading

comeusr commented Aug 8, 2025 •

edited

Loading