PaliGemma2 Processor returns wrong labels array when <image> token is present in `text`

### System Info

- `transformers` version: 4.47.0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.9.1
- Huggingface_hub version: 0.26.3
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: Tesla T4

### Who can help?

@ArthurZucker @molbap we chatted about the last paligemma release :)

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Here is a script that shows the problem:
```python
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image
import numpy as np

hf_token = "..."
processor = PaliGemmaProcessor.from_pretrained(
    "google/paligemma2-3b-pt-224", token=hf_token
)

text = ["How many shapes are green?"]
suffix = ["4"]

image = [Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))]
print(
    processor(
        images=image, text=text, suffix=suffix, return_tensors="pt", padding="longest"
    ).labels
)
text = ["<image>How many shapes are green?"]
print(
    processor(
        images=image, text=text, suffix=suffix, return_tensors="pt", padding="longest"
    ).labels
)
```


### Expected behavior


As you can see, the bottom one is missing the EOS token, which leads to bad finetunes! But the processor class warns me when the `<image>` token isn't present.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaliGemma2 Processor returns wrong labels array when <image> token is present in `text` #35200

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PaliGemma2 Processor returns wrong labels array when <image> token is present in text #35200

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PaliGemma2 Processor returns wrong labels array when <image> token is present in `text` #35200