Skip to content

Why can't inputs_embeds be used during the first generation in a multimodal model? #35131

@aohenuo

Description

@aohenuo

When reading the source code of transformers related to multimodal models, I noticed a specific behavior: in the Idefics v2 and Idefics v3 models, if only inputs_embeds are provided without input_ids during the first input, the following error occurs:
"When first calling the model, if inputs_embeds are passed, input_ids should not be None."
For example, raising value error's code is in 1384 line in modeling_idefics2.py
image

However, in Idefics v1, this exception does not appear.

I initially thought this might be due to differences between cross-attention-based multimodal models and adapter-based architectures. However, in other adapter-based multimodal models (e.g., LLAVA), this exception is also absent.

So, why was this exception introduced in Idefics v2 and v3? If you could provide an explanation, I would greatly appreciate it!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions