Skip to content

PaliGemma2 Processor returns wrong labels array when <image> token is present in text #35200

Closed
@probicheaux

Description

System Info

  • transformers version: 4.47.0
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.9.1
  • Huggingface_hub version: 0.26.3
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 0
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: Tesla T4

Who can help?

@ArthurZucker @molbap we chatted about the last paligemma release :)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here is a script that shows the problem:

from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image
import numpy as np

hf_token = "..."
processor = PaliGemmaProcessor.from_pretrained(
    "google/paligemma2-3b-pt-224", token=hf_token
)

text = ["How many shapes are green?"]
suffix = ["4"]

image = [Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))]
print(
    processor(
        images=image, text=text, suffix=suffix, return_tensors="pt", padding="longest"
    ).labels
)
text = ["<image>How many shapes are green?"]
print(
    processor(
        images=image, text=text, suffix=suffix, return_tensors="pt", padding="longest"
    ).labels
)

Expected behavior

As you can see, the bottom one is missing the EOS token, which leads to bad finetunes! But the processor class warns me when the <image> token isn't present.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions