Skip to content

Conversation

@Deep-unlearning
Copy link
Contributor

@Deep-unlearning Deep-unlearning commented Dec 11, 2024

What does this PR do?

This PR introduces a caching mechanism for the added_tokens_encoder property in tokenizers to improve performance. Previously, the added_tokens_encoder mapping was recomputed every time the property was accessed, leading to redundant computation during tasks that frequently access it, such as tokenization or decoding.

Motivation and Context
The motivation for this change is to optimize tokenizer performance, especially in workflows that repeatedly access the added_tokens_encoder property. By caching the result, this PR reduces overhead and improves runtime efficiency without altering the existing behavior of the library.

Key changes:

The added_tokens_encoder mapping is now cached on the first access and reused for subsequent calls.
The caching mechanism is implemented in a way that is backward-compatible and avoids unnecessary recomputation.

Some benchmarks

Composite Results

Model Composite WER (%) Composite RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 7.92 278.32 / 202.95 / 36%
distil/whisper-distil-large-v3 7.52 282.46 / 214.42 / 32%
distil/whisper-distil-medium.en 8.76 406.96 / 279.73 / 45%
openai/whisper-large 7.94 167.43 / 143.76 / 16%
openai/whisper-large-v2 7.83 167.95 / 144.45 / 16%
openai/whisper-large-v3 7.44 169.26 / 145.51 / 16%
openai/whisper-large-v3-turbo 7.83 268.72 / 197.98 / 36%
openai/whisper-medium.en 8.09 222.49 / 182.13 / 22%
openai/whisper-small.en 8.59 359.18 / 268.91 / 34%
openai/whisper-base.en 10.32 483.69 / 320.67 / 50%
openai/whisper-tiny.en 12.81 532.03 / 348.12 / 53%
Details

AMI Results

Model AMI WER (%) AMI RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 14.67 120.15 / 103.50 / 16%
distil/whisper-distil-large-v3 15.16 119.29 / 104.33 / 14%
distil/whisper-distil-medium.en 16.12 189.32 / 152.03 / 25%
openai/whisper-large 16.73 82.81 / 76.15 / 9%
openai/whisper-large-v2 16.74 85.65 / 79.49 / 7%
openai/whisper-large-v3 15.95 84.31 / 77.97 / 8%
openai/whisper-large-v3-turbo 16.13 116.17 / 98.83 / 18%
openai/whisper-medium.en 16.68 78.47 / 76.86 / 2%
openai/whisper-small.en 17.93 197.70 / 168.88 / 17%
openai/whisper-base.en 21.13 224.91 / 181.10 / 24%
openai/whisper-tiny.en 24.24 271.98 / 228.77 / 19%

Earnings22 Results

Model Earnings22 WER (%) Earnings22 RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 12.19 279.17 / 212.11 / 32%
distil/whisper-distil-large-v3 11.79 281.64 / 219.27 / 28%
distil/whisper-distil-medium.en 12.99 408.40 / 291.33 / 40%
openai/whisper-large 12.91 156.36 / 138.56 / 13%
openai/whisper-large-v2 12.05 173.81 / 151.92 / 14%
openai/whisper-large-v3 11.29 171.74 / 149.66 / 15%
openai/whisper-large-v3-turbo 11.63 274.35 / 202.67 / 35%
openai/whisper-medium.en 12.63 251.39 / 204.49 / 23%
openai/whisper-small.en 12.97 390.44 / 303.05 / 29%
openai/whisper-base.en 15.09 554.06 / 370.98 / 49%
openai/whisper-tiny.en 19.12 439.19 / 323.27 / 36%

Gigaspeech Results

Model GigaSpeech WER (%) GigaSpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 10.32 242.64 / 178.28 / 26%
distil/whisper-distil-large-v3 10.08 245.04 / 185.02 / 32%
distil/whisper-distil-medium.en 11.30 351.03 / 242.87 / 45%
openai/whisper-large 10.76 137.20 / 118.69 / 16%
openai/whisper-large-v2 10.67 139.24 / 120.05 / 15%
openai/whisper-large-v3 10.02 141.93 / 122.97 / 16%
openai/whisper-large-v3-turbo 10.14 229.71 / 168.52 / 36%
openai/whisper-medium.en 11.03 177.60 / 151.70 / 17%
openai/whisper-small.en 11.35 271.56 / 213.19 / 27%
openai/whisper-base.en 12.83 357.94 / 253.20 / 41%
openai/whisper-tiny.en 14.08 421.61 / 284.52 / 48%

LibriSpeech Clean Results

Model LibriSpeech Clean WER (%) LibriSpeech Clean RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 2.94 286.00 / 205.44 / 39%
distil/whisper-distil-large-v3 2.54 288.02 / 217.52 / 32%
distil/whisper-distil-medium.en 3.69 415.82 / 280.95 / 48%
openai/whisper-large 2.73 181.37 / 150.35 / 21%
openai/whisper-large-v2 2.83 159.01 / 135.81 / 17%
openai/whisper-large-v3 2.01 179.93 / 151.42 / 19%
openai/whisper-large-v3-turbo 2.10 278.29 / 201.89 / 38%
openai/whisper-medium.en 3.02 244.38 / 196.85 / 24%
openai/whisper-small.en 3.05 408.91 / 280.23 / 46%
openai/whisper-base.en 4.25 583.91 / 353.97 / 65%
openai/whisper-tiny.en 5.66 639.70 / 376.14 / 70%

LibriSpeech Other Results

Model LibriSpeech Other WER (%) LibriSpeech Other RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 6.84 248.08 / 177.63 / 40%
distil/whisper-distil-large-v3 5.19 259.09 / 199.72 / 30%
distil/whisper-distil-medium.en 8.35 349.71 / 236.81 / 48%
openai/whisper-large 5.54 164.39 / 138.73 / 18%
openai/whisper-large-v2 5.14 162.81 / 139.05 / 17%
openai/whisper-large-v3 3.91 163.21 / 140.22 / 16%
openai/whisper-large-v3-turbo 4.24 257.22 / 188.87 / 36%
openai/whisper-medium.en 5.85 222.76 / 181.65 / 23%
openai/whisper-small.en 7.25 367.64 / 262.68 / 40%
openai/whisper-base.en 10.35 445.31 / 293.26 / 52%
openai/whisper-tiny.en 15.45 420.61 / 298.15 / 41%

SPGISpeech Results

Model SPGISpeech WER (%) SPGISpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 3.30 331.26 / 232.50 / 42%
distil/whisper-distil-large-v3 3.27 337.55 / 249.00 / 36%
distil/whisper-distil-medium.en 3.83 478.64 / 318.96 / 50%
openai/whisper-large 3.20 198.02 / 167.48 / 18%
openai/whisper-large-v2 3.87 196.77 / 166.89 / 18%
openai/whisper-large-v3 2.94 197.37 / 166.92 / 18%
openai/whisper-large-v3-turbo 2.97 320.11 / 229.57 / 39%
openai/whisper-medium.en 3.33 218.35 / 285.07 / 31%
openai/whisper-small.en 3.60 427.56 / 307.90 / 39%
openai/whisper-base.en 4.26 601.14 / 372.83 / 61%
openai/whisper-tiny.en 5.93 648.97 / 398.03 / 63%

TEDLIUM Results

Model TEDLIUM WER (%) TEDLIUM RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 4.87 274.60 / 197.85 / 39%
distil/whisper-distil-large-v3 3.86 294.14 / 217.54 / 35%
distil/whisper-distil-medium.en 4.84 425.02 / 282.89 / 50%
openai/whisper-large 3.91 166.87 / 143.34 / 16%
openai/whisper-large-v2 3.90 166.91 / 143.77 / 16%
openai/whisper-large-v3 3.86 166.75 / 142.18 / 17%
openai/whisper-large-v3-turbo 3.57 288.34 / 199.61 / 44%
openai/whisper-medium.en 4.11 237.28 / 185.40 / 28%
openai/whisper-small.en 4.07 352.07 / 263.51 / 34%
openai/whisper-base.en 4.87 507.93 / 336.00 / 51%
openai/whisper-tiny.en 5.97 571.50 / 352.79 / 62%

Voxpopuli Results

Model VoxPopuli WER (%) VoxPopuli RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 8.24 348.26 / 249.25 / 40%
distil/whisper-distil-large-v3 8.25 359.48 / 262.70 / 37%
distil/whisper-distil-medium.en 9.00 525.00 / 345.95 / 52%
openai/whisper-large 7.76 218.21 / 182.69 / 19%
openai/whisper-large-v2 7.48 219.32 / 182.27 / 20%
openai/whisper-large-v3 9.54 213.33 / 180.51 / 18%
openai/whisper-large-v3-turbo 11.87 339.76 / 247.99 / 37%
openai/whisper-medium.en 8.06 309.17 / 239.06 / 29%
openai/whisper-small.en 8.50 478.84 / 336.49 / 42%
openai/whisper-base.en 9.76 681.44 / 418.28 / 63%
openai/whisper-tiny.en 12.00 647.46 / 405.49 / 60%

Benchmark scripts available there: https://github.com/huggingface/open_asr_leaderboard/tree/main/transformers

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker
@Vaibhavs10
This changes was suggested by @pzelasko

Copy link
Contributor

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥

@Vaibhavs10 Vaibhavs10 requested a review from eustlb December 11, 2024 18:46
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pzelasko
Copy link

Yup, large part of the overhead is gone now. I ran a quick check and you save roughly 1 second per batch in the open ASR leaderboard for whisper-turbo. I think there may still be some overhead from the tokenizer, but I'm not sure how much exactly. You should recompute the RTFx on the full test set.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We already have a state for the _added_tokens_encoder which should be filling this!
The main issue is just that when performances are required, _added_tokens_encoder should be "called" instead of added_tokens_encoder!
But very very welcome! Let's reduce the overhead!

Returns the sorted mapping from string to index. The cache is dynamically invalidated if `_added_tokens_decoder`
has changed since the last computation.
"""
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
return self._added_tokens_encoder

this would not work as you probably need the content to be sorted, which is why we have the non-sorted _added_tokens_encoder. We can actually sort it (define it as OrderedDict) as we deprecated python <= 3.9!

@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Dec 23, 2024
@ArthurZucker
Copy link
Collaborator

Also you could compute with both slow and fast tokenizers, I don't know which you are using!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core: Tokenization Internals of the library; Tokenization.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants