LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

NicolasDrapier · 2024-12-09T13:40:25Z

System Info

transformers version: 4.43.1
Platform: Linux-6.8.5-1-default-x86_64-with-glibc2.39
Python version: 3.11.9
Huggingface_hub version: 0.23.5
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: True
Using GPU in script?: True
GPU type: NVIDIA L40S

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Description

I am trying to use the AutoAWQ library to quantize a Pixtral model (mistral-community/Pixtral-Large-Instruct-2411). However, I am encountering the following error:

File "/quantization/quant/lib64/python3.11/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
                                               ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'shape'

Code

Here is the code I am using:

import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = r'/data/models/mistral/pixtral-large-instruct-2411' # from https://huggingface.co/mistral-community/Pixtral-Large-Instruct-2411
quant_path = r'/data/models/mistral/pixtral-large-instruct-2411-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
os.makedirs(quant_path, exist_ok=True)

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')

Analysis

The model I am using is Pixtral-Large-Instruct-2411, but its configuration is LlavaForConditionalGeneration. The issue arises in the Transformers library's source code where image_features remains None if pixel_values is None. Consequently, in the method _merge_input_ids_with_image_features, the first line num_images, num_image_patches, embed_dim = image_features.shape tries to access the shape attribute of None, resulting in an AttributeError.

image_features = None
if pixel_values is not None:
    image_features = self.get_image_features(
        pixel_values=pixel_values,
        vision_feature_layer=vision_feature_layer,
        vision_feature_select_strategy=vision_feature_select_strategy,
    )

if legacy_processing:
    logger.warning_once(
        "Expanding inputs for image tokens in LLaVa should be done in processing. "
        "Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly "
        "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
        "Using processors without these attributes in the config is deprecated and will throw an error in v4.50."
    )
    # prefill stage vs decoding stage (legacy behavior copied)
    if input_ids.shape[1] != 1:
        inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
            image_features, inputs_embeds, input_ids, attention_mask, labels # <-- image_features is still None here
        )
        cache_position = torch.arange(attention_mask.shape[1], device=attention_mask.device)

Steps to Reproduce

Ensure the Pixtral-Large-Instruct-2411 model is available at the specified path.
Run the provided code snippet.

Actual Behavior

An AttributeError is raised due to image_features being None.

Expected behavior

The model should be loaded, quantized, and saved without any errors.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-12-09T13:58:23Z

@NicolasDrapier Indeed llava cannot work with text only inputs currently and expects always an image as complementary input, which is why it is breaking when quantizing. The issue is known and should be fixed by #34502, we no longer should support _merge_input_ids_with_image_features

NicolasDrapier added the bug label Dec 9, 2024

zucchini-nlp added Multimodal WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress labels Dec 9, 2024

zucchini-nlp self-assigned this Dec 9, 2024

zucchini-nlp linked a pull request Dec 10, 2024 that will close this issue

VLMs: major clean up 🧼 #34502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

NicolasDrapier commented Dec 9, 2024 •

edited

Loading

zucchini-nlp commented Dec 9, 2024

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

Comments

NicolasDrapier commented Dec 9, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Description

Code

Analysis

Steps to Reproduce

Actual Behavior

Expected behavior

zucchini-nlp commented Dec 9, 2024

NicolasDrapier commented Dec 9, 2024 •

edited

Loading