added cached tokenizer #35218

Deep-unlearning · 2024-12-11T18:43:56Z

What does this PR do?

This PR introduces a caching mechanism for the added_tokens_encoder property in tokenizers to improve performance. Previously, the added_tokens_encoder mapping was recomputed every time the property was accessed, leading to redundant computation during tasks that frequently access it, such as tokenization or decoding.

Motivation and Context
The motivation for this change is to optimize tokenizer performance, especially in workflows that repeatedly access the added_tokens_encoder property. By caching the result, this PR reduces overhead and improves runtime efficiency without altering the existing behavior of the library.

Key changes:

The added_tokens_encoder mapping is now cached on the first access and reused for subsequent calls.
The caching mechanism is implemented in a way that is backward-compatible and avoids unnecessary recomputation.

Some benchmarks

Composite Results

Model	Composite WER (%)	Composite RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	7.92	278.32 / 202.95 / 36%
distil/whisper-distil-large-v3	7.52	282.46 / 214.42 / 32%
distil/whisper-distil-medium.en	8.76	406.96 / 279.73 / 45%
openai/whisper-large	7.94	167.43 / 143.76 / 16%
openai/whisper-large-v2	7.83	167.95 / 144.45 / 16%
openai/whisper-large-v3	7.44	169.26 / 145.51 / 16%
openai/whisper-large-v3-turbo	7.83	268.72 / 197.98 / 36%
openai/whisper-medium.en	8.09	222.49 / 182.13 / 22%
openai/whisper-small.en	8.59	359.18 / 268.91 / 34%
openai/whisper-base.en	10.32	483.69 / 320.67 / 50%
openai/whisper-tiny.en	12.81	532.03 / 348.12 / 53%

Details

AMI Results

Model	AMI WER (%)	AMI RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	14.67	120.15 / 103.50 / 16%
distil/whisper-distil-large-v3	15.16	119.29 / 104.33 / 14%
distil/whisper-distil-medium.en	16.12	189.32 / 152.03 / 25%
openai/whisper-large	16.73	82.81 / 76.15 / 9%
openai/whisper-large-v2	16.74	85.65 / 79.49 / 7%
openai/whisper-large-v3	15.95	84.31 / 77.97 / 8%
openai/whisper-large-v3-turbo	16.13	116.17 / 98.83 / 18%
openai/whisper-medium.en	16.68	78.47 / 76.86 / 2%
openai/whisper-small.en	17.93	197.70 / 168.88 / 17%
openai/whisper-base.en	21.13	224.91 / 181.10 / 24%
openai/whisper-tiny.en	24.24	271.98 / 228.77 / 19%

Earnings22 Results

Model	Earnings22 WER (%)	Earnings22 RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	12.19	279.17 / 212.11 / 32%
distil/whisper-distil-large-v3	11.79	281.64 / 219.27 / 28%
distil/whisper-distil-medium.en	12.99	408.40 / 291.33 / 40%
openai/whisper-large	12.91	156.36 / 138.56 / 13%
openai/whisper-large-v2	12.05	173.81 / 151.92 / 14%
openai/whisper-large-v3	11.29	171.74 / 149.66 / 15%
openai/whisper-large-v3-turbo	11.63	274.35 / 202.67 / 35%
openai/whisper-medium.en	12.63	251.39 / 204.49 / 23%
openai/whisper-small.en	12.97	390.44 / 303.05 / 29%
openai/whisper-base.en	15.09	554.06 / 370.98 / 49%
openai/whisper-tiny.en	19.12	439.19 / 323.27 / 36%

Gigaspeech Results

Model	GigaSpeech WER (%)	GigaSpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	10.32	242.64 / 178.28 / 26%
distil/whisper-distil-large-v3	10.08	245.04 / 185.02 / 32%
distil/whisper-distil-medium.en	11.30	351.03 / 242.87 / 45%
openai/whisper-large	10.76	137.20 / 118.69 / 16%
openai/whisper-large-v2	10.67	139.24 / 120.05 / 15%
openai/whisper-large-v3	10.02	141.93 / 122.97 / 16%
openai/whisper-large-v3-turbo	10.14	229.71 / 168.52 / 36%
openai/whisper-medium.en	11.03	177.60 / 151.70 / 17%
openai/whisper-small.en	11.35	271.56 / 213.19 / 27%
openai/whisper-base.en	12.83	357.94 / 253.20 / 41%
openai/whisper-tiny.en	14.08	421.61 / 284.52 / 48%

LibriSpeech Clean Results

Model	LibriSpeech Clean WER (%)	LibriSpeech Clean RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	2.94	286.00 / 205.44 / 39%
distil/whisper-distil-large-v3	2.54	288.02 / 217.52 / 32%
distil/whisper-distil-medium.en	3.69	415.82 / 280.95 / 48%
openai/whisper-large	2.73	181.37 / 150.35 / 21%
openai/whisper-large-v2	2.83	159.01 / 135.81 / 17%
openai/whisper-large-v3	2.01	179.93 / 151.42 / 19%
openai/whisper-large-v3-turbo	2.10	278.29 / 201.89 / 38%
openai/whisper-medium.en	3.02	244.38 / 196.85 / 24%
openai/whisper-small.en	3.05	408.91 / 280.23 / 46%
openai/whisper-base.en	4.25	583.91 / 353.97 / 65%
openai/whisper-tiny.en	5.66	639.70 / 376.14 / 70%

LibriSpeech Other Results

Model	LibriSpeech Other WER (%)	LibriSpeech Other RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	6.84	248.08 / 177.63 / 40%
distil/whisper-distil-large-v3	5.19	259.09 / 199.72 / 30%
distil/whisper-distil-medium.en	8.35	349.71 / 236.81 / 48%
openai/whisper-large	5.54	164.39 / 138.73 / 18%
openai/whisper-large-v2	5.14	162.81 / 139.05 / 17%
openai/whisper-large-v3	3.91	163.21 / 140.22 / 16%
openai/whisper-large-v3-turbo	4.24	257.22 / 188.87 / 36%
openai/whisper-medium.en	5.85	222.76 / 181.65 / 23%
openai/whisper-small.en	7.25	367.64 / 262.68 / 40%
openai/whisper-base.en	10.35	445.31 / 293.26 / 52%
openai/whisper-tiny.en	15.45	420.61 / 298.15 / 41%

SPGISpeech Results

Model	SPGISpeech WER (%)	SPGISpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	3.30	331.26 / 232.50 / 42%
distil/whisper-distil-large-v3	3.27	337.55 / 249.00 / 36%
distil/whisper-distil-medium.en	3.83	478.64 / 318.96 / 50%
openai/whisper-large	3.20	198.02 / 167.48 / 18%
openai/whisper-large-v2	3.87	196.77 / 166.89 / 18%
openai/whisper-large-v3	2.94	197.37 / 166.92 / 18%
openai/whisper-large-v3-turbo	2.97	320.11 / 229.57 / 39%
openai/whisper-medium.en	3.33	218.35 / 285.07 / 31%
openai/whisper-small.en	3.60	427.56 / 307.90 / 39%
openai/whisper-base.en	4.26	601.14 / 372.83 / 61%
openai/whisper-tiny.en	5.93	648.97 / 398.03 / 63%

TEDLIUM Results

Model	TEDLIUM WER (%)	TEDLIUM RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	4.87	274.60 / 197.85 / 39%
distil/whisper-distil-large-v3	3.86	294.14 / 217.54 / 35%
distil/whisper-distil-medium.en	4.84	425.02 / 282.89 / 50%
openai/whisper-large	3.91	166.87 / 143.34 / 16%
openai/whisper-large-v2	3.90	166.91 / 143.77 / 16%
openai/whisper-large-v3	3.86	166.75 / 142.18 / 17%
openai/whisper-large-v3-turbo	3.57	288.34 / 199.61 / 44%
openai/whisper-medium.en	4.11	237.28 / 185.40 / 28%
openai/whisper-small.en	4.07	352.07 / 263.51 / 34%
openai/whisper-base.en	4.87	507.93 / 336.00 / 51%
openai/whisper-tiny.en	5.97	571.50 / 352.79 / 62%

Voxpopuli Results

Model	VoxPopuli WER (%)	VoxPopuli RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2	8.24	348.26 / 249.25 / 40%
distil/whisper-distil-large-v3	8.25	359.48 / 262.70 / 37%
distil/whisper-distil-medium.en	9.00	525.00 / 345.95 / 52%
openai/whisper-large	7.76	218.21 / 182.69 / 19%
openai/whisper-large-v2	7.48	219.32 / 182.27 / 20%
openai/whisper-large-v3	9.54	213.33 / 180.51 / 18%
openai/whisper-large-v3-turbo	11.87	339.76 / 247.99 / 37%
openai/whisper-medium.en	8.06	309.17 / 239.06 / 29%
openai/whisper-small.en	8.50	478.84 / 336.49 / 42%
openai/whisper-base.en	9.76	681.44 / 418.28 / 63%
openai/whisper-tiny.en	12.00	647.46 / 405.49 / 60%

Benchmark scripts available there: https://github.com/huggingface/open_asr_leaderboard/tree/main/transformers

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker
@Vaibhavs10
This changes was suggested by @pzelasko

Vaibhavs10

🔥 🔥 🔥

HuggingFaceDocBuilderDev · 2024-12-11T19:11:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pzelasko · 2024-12-11T20:26:29Z

Yup, large part of the overhead is gone now. I ran a quick check and you save roughly 1 second per batch in the open ASR leaderboard for whisper-turbo. I think there may still be some overhead from the tokenizer, but I'm not sure how much exactly. You should recompute the RTFx on the full test set.

ArthurZucker

Hey! We already have a state for the _added_tokens_encoder which should be filling this!
The main issue is just that when performances are required, _added_tokens_encoder should be "called" instead of added_tokens_encoder!
But very very welcome! Let's reduce the overhead!

ArthurZucker · 2024-12-23T14:55:12Z

src/transformers/tokenization_utils.py

+        Returns the sorted mapping from string to index. The cache is dynamically invalidated if `_added_tokens_decoder`
+        has changed since the last computation.
        """
-        return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}


Suggested change

return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}

return self._added_tokens_encoder

this would not work as you probably need the content to be sorted, which is why we have the non-sorted _added_tokens_encoder. We can actually sort it (define it as OrderedDict) as we deprecated python <= 3.9!

ArthurZucker · 2024-12-23T14:56:03Z

Also you could compute with both slow and fast tokenizers, I don't know which you are using!

added cached tokenizer

96945e2

Vaibhavs10 reviewed Dec 11, 2024

View reviewed changes

Vaibhavs10 requested a review from eustlb December 11, 2024 18:46

Merge branch 'main' into add-tokenizer-caching

44633bb

ArthurZucker reviewed Dec 23, 2024

View reviewed changes

ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Dec 23, 2024

Deep-unlearning closed this Jan 16, 2025

Deep-unlearning reopened this Jan 16, 2025

Deep-unlearning requested a review from Rocketknight1 as a code owner January 16, 2025 16:23

Deep-unlearning marked this pull request as draft January 16, 2025 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added cached tokenizer #35218

added cached tokenizer #35218

Uh oh!

Deep-unlearning commented Dec 11, 2024 •

edited

Loading

Uh oh!

Vaibhavs10 left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2024

Uh oh!

pzelasko commented Dec 11, 2024

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Dec 23, 2024

Uh oh!

ArthurZucker commented Dec 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
	return self._added_tokens_encoder

added cached tokenizer #35218

Are you sure you want to change the base?

added cached tokenizer #35218

Uh oh!

Conversation

Deep-unlearning commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Some benchmarks

Composite Results

AMI Results

Earnings22 Results

Gigaspeech Results

LibriSpeech Clean Results

LibriSpeech Other Results

SPGISpeech Results

TEDLIUM Results

Voxpopuli Results

Before submitting

Who can review?

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2024

Uh oh!

pzelasko commented Dec 11, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Dec 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Deep-unlearning commented Dec 11, 2024 •

edited

Loading