Why can't `inputs_embeds` be used during the first generation in a multimodal model?

When reading the source code of transformers related to multimodal models, I noticed a specific behavior: in the Idefics v2 and Idefics v3 models, if only inputs_embeds are provided without input_ids during the first input, the following error occurs: 
***"When first calling the model, if inputs_embeds are passed, input_ids should not be None."***
For example, raising value error's code is in 1384 line in modeling_idefics2.py
![image](https://github.com/user-attachments/assets/b037f6fe-7c18-42c9-8655-9a96941fd70e)

However, in Idefics v1, this exception does not appear.

I initially thought this might be due to differences between cross-attention-based multimodal models and adapter-based architectures. However, in other adapter-based multimodal models (e.g., LLAVA), this exception is also absent.

So, why was this exception introduced in Idefics v2 and v3? If you could provide an explanation, I would greatly appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't `inputs_embeds` be used during the first generation in a multimodal model? #35131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why can't inputs_embeds be used during the first generation in a multimodal model? #35131

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why can't `inputs_embeds` be used during the first generation in a multimodal model? #35131