Skip to content

Why can't inputs_embeds be used during the first generation in a multimodal model? #35131

Closed
@aohenuo

Description

When reading the source code of transformers related to multimodal models, I noticed a specific behavior: in the Idefics v2 and Idefics v3 models, if only inputs_embeds are provided without input_ids during the first input, the following error occurs:
"When first calling the model, if inputs_embeds are passed, input_ids should not be None."
For example, raising value error's code is in 1384 line in modeling_idefics2.py
image

However, in Idefics v1, this exception does not appear.

I initially thought this might be due to differences between cross-attention-based multimodal models and adapter-based architectures. However, in other adapter-based multimodal models (e.g., LLAVA), this exception is also absent.

So, why was this exception introduced in Idefics v2 and v3? If you could provide an explanation, I would greatly appreciate it!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions