WhisperTokenizer decode is offsetting timestamps incorrectly

### System Info

- `transformers` version: 4.47.0.dev0
- Platform: Linux-5.15.0-1073-azure-x86_64-with-glibc2.35
- Python version: 3.11.0rc1
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): 2.16.1 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: Tesla T4

### Who can help?

@ylacombe
@eustlb
@sanchit-gandhi 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Decoding the output from Whisper using the WhisperTokenizer is seemingly offsetting the timestamps incorrectly in consecutive chunks, which for long audios leads to timestamp accuracy significantly degrading over time.

I have not found any open bug report on this matter. The issue https://github.com/huggingface/transformers/issues/31942 and the PR intended to fix it https://github.com/huggingface/transformers/pull/32131 are related, and hence I've added @sanchit-gandhi to this issue as well.

From my understanding, the above mentioned PR solves it under the assumption that the predicted timestamps at all time spans the entire previous chunk and thus incrementing the timestamp in consecutive chunks based on the `cur_max_timestamp` would solve it. However, `cur_max_timestamp` is not generally correctly offsetting the timestamps. The example described in https://github.com/huggingface/transformers/pull/32131#issuecomment-2264428009 does generate the correct output, but unfortunately slightly altering the silence leads to incorrect timestamps.

The following snippets should reproduce the issue (simply increasing silence from 15s to 16s):
```python
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, AutoProcessor
import numpy as np

# load model + processor
processor = AutoProcessor.from_pretrained("openai/whisper-small.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

# load dataset
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]["array"]
sampling_rate = dataset[0]["audio"]["sampling_rate"]

sample = [*sample[:15 * sampling_rate], *np.zeros(16 * sampling_rate).tolist(), *sample[15 * sampling_rate:]]
sample = np.array(sample)

# pre-process
inputs = processor(
    sample,
    sampling_rate=16_000,
    padding="longest",
    truncation=False,
    return_attention_mask=True,
    return_tensors="pt",
)

# inference
output = model.generate(**inputs, return_timestamps=True, return_segments=True)

# pass token ids to processor's decode method
result = processor.batch_decode(output["sequences"], skip_special_tokens=True, output_offsets=True)

# format output offsets for readability
print("\n".join([f"{chunk['timestamp'][0]:.2f} -> {chunk['timestamp'][1]:.2f} : {chunk['text']}" for chunk in result[0]["offsets"]]))
```
which results in the following output:
```
0.00 -> 6.38 :  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
6.38 -> 11.32 :  Nor is Mr. Quilter's manner less interesting than his matter.
11.32 -> 15.00 :  He tells us that at this festive season of the year,
15.00 -> 21.76 :  With Christmas and roast beef looming before us, similes drawn from eating and its results
21.76 -> 24.80 :  occur most readily to the mind.
24.80 -> 30.38 :  He has grave doubts whether Sir Frederick Layton's work is really Greek after all and
30.38 -> 34.00 :  can discover in it but little of rocky Ithaca.
34.00 -> 41.28 :  Lenell's pictures are a sort of up-guards-and-atom paintings, and Mason's exquisite ittles
41.28 -> 49.12 :  are as national as a jingo poem. Mr. Burkett fosters landscape's smile at one much in
49.12 -> 55.76 :  the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his
55.76 -> 62.16 :  sitter a cheerful slap on the back before he says, like a shampoo or in a Turkish bath,
62.16 -> 63.16 :  Next Man
```
while inspecting the `output["segments"]` gives the following segment timestamps:
```
0.00 -> 6.38
6.38 -> 11.32
11.32 -> 15.00
30.00 -> 36.76
36.76 -> 39.80
39.80 -> 45.38
45.38 -> 49.00
49.00 -> 56.28
56.28 -> 64.12
64.12 -> 70.76
70.76 -> 77.16
77.16 -> 78.16
``` 
which in turn are close to the output https://github.com/openai/whisper generates and sort of explains how the fourth segment becomes `15.00 -> 21.76` instead of the expected `30.00 -> 36.76`.

### Expected behavior

I would expect that the `WhisperTokenizer` can correctly decode handle offsets in timestamps and not have timestamps become misaligned to the corresponding chunk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WhisperTokenizer decode is offsetting timestamps incorrectly #34472

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WhisperTokenizer decode is offsetting timestamps incorrectly #34472

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions