Description
When reading the source code of transformers related to multimodal models, I noticed a specific behavior: in the Idefics v2 and Idefics v3 models, if only inputs_embeds are provided without input_ids during the first input, the following error occurs:
"When first calling the model, if inputs_embeds are passed, input_ids should not be None."
For example, raising value error's code is in 1384 line in modeling_idefics2.py
However, in Idefics v1, this exception does not appear.
I initially thought this might be due to differences between cross-attention-based multimodal models and adapter-based architectures. However, in other adapter-based multimodal models (e.g., LLAVA), this exception is also absent.
So, why was this exception introduced in Idefics v2 and v3? If you could provide an explanation, I would greatly appreciate it!
Activity