🔴 Video processors as a separate class #35206

zucchini-nlp · 2024-12-11T09:53:52Z

What does this PR do?

This is quite a breaking PR, because from now on some processor will not have a separate self.video_processor and save/load priority is given to new file names image/video_preprocessor_config.json. Save/load perserved BC and can load from any file name

Reasons for PR:

Currently the main issue is to allow video processors operate separately from image processors and have their own set of initial attributes, as mentioned in the issue
Easier handling of multimodal models and more freedom in processing when images and videos have their own logic (i.e. llava-next-video)

Description - more verbose

Fixes #33504 and makes a separate Base class for video processors, only those that are from VLMs because video-only models like ViViT don't need both, image and video processor. We can add it in other video-only them later in the run, if needed.

This idea is to keep video processors as child of image processor mixin, but the processor can use transform over one video as a 4D array instead of looping over each frame. The only transform that still works on per-frame basis is resize. It could be rewritten in torch maybe, but not worth the effort. In terms of speed, the new slow video processor performs same as old image processor, and has some speedup with larger batch sizes. The advantage here is mostly about readability, cleaner code and a separate API for videos.

For the fast video processors, it is basically same as in image except for the extra dimension we have for timeframes. Plus, I added convert_rgb in np and torch. Also, we use VideosKwargs instead of having to add a new TypedDict. See my comments below, in files changed 👇🏻

What was done?

What was done:

Separate VideoProcessorBase and VideoProcessorBaseFast as added, recommend to start review from there
Video processors have an autoclass mapping to load with AutoVideoProcessor
Image processor config is now called image_preprocessor_config.json. We keep BC by trying to load from new image_preprocessor_config.json if exists, otherwise fallback to preprocessor_config.json with a deprecation warning
Most models already had a separate video processor, so nothing changes for them except for the parent class
Model where one preprocessor accepted both, images and video, will and continue preprocessing both modalities, but raise a deprecation warning (Qwen2VL and VideoLlava)
Tests, a lot of tests changed and added 🛠️
The image/video/general processing tests were run on all models touched by this PR, including one slow generation test per model to make sure official checkpoints are loadable

Usage examples

How does it work when loading (same goes for model-class laoding)

AutoProcessor.from_pretrained(model_id)
AutoImageProcessor.from_pretrained(model_id, use_fast=True/False)
AutoVideoProcessor.from_pretrained(model_id, use_fast=True/False, device="cpu"/"cuda")

If model was not updated and has only processor_config: the processor will grab the old preprocessor_config and use it for video and for image processing classes with shared set of args. Warning emitted!
If model was updated and has only image_processor_config and video_processor_config: the processor will grab the files and load each preprocessor with their own set of args
If model has all types of config files and didn't delete the old processor_config: priority will be given to (image/video)_processor_config with different sets of args for each

When saving the model back, we'll save in image_processor_config and video_processor_config files

Benchmark results on llava onevision

Benchmark on different batch sizes and number of frames per video sampled. Also with different devices and input types: could be numpy, torch or list PIL frames.

Next steps:

Need step is to add tests for video inputs in test_processing_common.py (there was a PR somewhere but prob it got stale)
We also might need to rename "feature extractor name" to audio_preprocessor_config, just for consistency and easier navigation in hub configs
The frame sampling function which we just added for video LLMs can now be a method of video processor. So each video model will have its own sampling pre-defined which imo eases stuff for instruct inference

HuggingFaceDocBuilderDev · 2025-01-24T15:53:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/image_processing_base.py

src/transformers/models/auto/feature_extraction_auto.py

zucchini-nlp · 2025-01-27T10:57:52Z

This one is ready for review, I guess the first review will be from @qubvel . The failing tests are not related, very flaky

qubvel

Great work, thanks for working on this feature! I did the first, not very thorough, pass, see the comments below

docs/source/en/_toctree.yml

docs/source/en/main_classes/video_processor.md

src/transformers/image_processing_base.py

src/transformers/models/instructblipvideo/processing_instructblipvideo.py

src/transformers/models/instructblipvideo/video_processing_instructblipvideo.py

utils/test_module/custom_video_processing.py

tests/generation/test_utils.py

Co-authored-by: Pavel Iakubovskii <[email protected]>

ArthurZucker

Very nice! I am just wondering if we can't for now, and for models for which it makes sense, just inherit the image processor, and the base video processor when we can call the same pre processing functions!

ArthurZucker · 2025-05-08T11:34:14Z

src/transformers/video_processing_utils.py

+        elif is_remote_url(pretrained_model_name_or_path):
+            video_processor_file = pretrained_model_name_or_path
+            resolved_video_processor_file = download_url(pretrained_model_name_or_path)
+        else:


a whole bunch of this is re-used from other processors, I really don't mind some for of inheritance or just using the same function if we can!

ArthurZucker · 2025-05-08T11:58:04Z

TLDR from internal coms:

long term we want to:

inherit from ImageProcessor and BaseVideoProcessor,

prepare the videos to call vmap and use inherited ImageProcessor defined functions

* initial design * update all video processors * add tests * need to add qwen2-vl (not tested yet) * add qwen2-vl in auto map * fix copies * isort * resolve confilicts kinda * nit: * qwen2-vl is happy now * qwen2-5 happy * other models are happy * fix copies * fix tests * add docs * CI green now? * add more tests * even more changes + tests * doc builder fail * nit * Update src/transformers/models/auto/processing_auto.py Co-authored-by: Pavel Iakubovskii <[email protected]> * small update * imports correctly * dump, otherwise this is getting unmanagebale T-T * dump * update * another update * update * tests * move * modular * docs * test * another update * init * remove flakiness in tests * fixup * clean up and remove commented lines * docs * skip this one! * last fix after rebasing * run fixup * delete slow files * remove unnecessary tests + clean up a bit * small fixes * fix tests * more updates * docs * fix tests * update * style * fix qwen2-5-vl * fixup * fixup * unflatten batch when preparing * dump, come back soon * add docs and fix some tests * how to guard this with new dummies? * chat templates in qwen * address some comments * remove `Fast` suffix * fixup * oops should be imported from transforms * typo in requires dummies * new model added with video support * fixup once more * last fixup I hope * revert image processor name + comments * oh, this is why fetch test is failing * fix tests * fix more tests * fixup * add new models: internvl, smolvlm * update docs * imprt once * fix failing tests * do we need to guard it here again, why? * new model was added, update it * remove testcase from tester * fix tests * make style * not related CI fail, lets' just fix here * mark flaky for now, filas 15 out of 100 * style * maybe we can do this way? * don't download images in setup class --------- Co-authored-by: Pavel Iakubovskii <[email protected]>

zucchini-nlp added 5 commits October 30, 2024 18:21

initial design

66095f0

update all video processors

58467f7

add tests

f8dfd7f

need to add qwen2-vl (not tested yet)

324c7dc

add qwen2-vl in auto map

c4940d1

zucchini-nlp changed the title ~~Video processors as a separate class~~ 🔴 Video processors as a separate class Dec 11, 2024

zucchini-nlp added 12 commits December 11, 2024 15:17

fix copies

4055952

isort

3c9c36d

merge main

a798170

resolve confilicts kinda

575ba16

nit:

722c2d8

qwen2-vl is happy now

f615feb

qwen2-5 happy

61c302a

other models are happy

6cbf83b

fix copies

43446ec

fix tests

eef3317

add docs

d204dcf

CI green now?

81eacff

zucchini-nlp added 4 commits January 24, 2025 17:25

add more tests

571c753

even more changes + tests

06ffc01

Merge remote-tracking branch 'upstream/main' into video_processor

9484f73

doc builder fail

731fa6d

zucchini-nlp commented Jan 27, 2025

View reviewed changes

src/transformers/image_processing_base.py Show resolved Hide resolved

zucchini-nlp commented Jan 27, 2025

View reviewed changes

src/transformers/models/auto/feature_extraction_auto.py Outdated Show resolved Hide resolved

nit

e3df4e0

zucchini-nlp requested a review from qubvel January 27, 2025 10:57

qubvel reviewed Jan 27, 2025

View reviewed changes

Update src/transformers/models/auto/processing_auto.py

9182400

Co-authored-by: Pavel Iakubovskii <[email protected]>

zucchini-nlp added 5 commits April 21, 2025 16:30

fix failing tests

603d2fe

do we need to guard it here again, why?

f897803

Merge branch 'main' into video_processor

3a9c1ec

new model was added, update it

cd40b62

remove testcase from tester

5f5d2fd

ArthurZucker self-requested a review May 8, 2025 09:30

ArthurZucker approved these changes May 8, 2025

View reviewed changes

zucchini-nlp added 4 commits May 8, 2025 16:54

fix tests

754e6f5

merge main

2070da0

make style

51c96f7

not related CI fail, lets' just fix here

22050ad

zucchini-nlp enabled auto-merge (squash) May 8, 2025 15:31

zucchini-nlp disabled auto-merge May 8, 2025 15:31

zucchini-nlp added 8 commits May 8, 2025 18:03

Merge branch 'main' into video_processor

5d2df7e

mark flaky for now, filas 15 out of 100

cc662a2

style

8a90f7f

Merge branch 'main' into video_processor

4b260b9

Merge branch 'main' into video_processor

263b124

maybe we can do this way?

82cb248

Merge branch 'main' into video_processor

fabf88f

don't download images in setup class

bae6138

zucchini-nlp merged commit a31fa21 into huggingface:main May 12, 2025
20 checks passed

Andy-Cheng mentioned this pull request May 13, 2025

[Bug] Unrecognized video processor in OpenGVLab/InternVL3-2B-hf. OpenGVLab/InternVL#1040

Closed

3 tasks

zucchini-nlp mentioned this pull request May 13, 2025

[video processor] fix tests #38104

Merged

jiqing-feng mentioned this pull request May 20, 2025

Llava-next-video got different results after using the new video processor #38221

Closed

4 tasks

yeliudev mentioned this pull request May 26, 2025

Fix Qwen2.5-VL Video Processor #38366

Merged

5 tasks

Tcc0403 mentioned this pull request May 26, 2025

CI failure due to transformers VLM change linkedin/Liger-Kernel#723

Closed

22quinn mentioned this pull request Jun 24, 2025

[Bug]: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected vllm-project/vllm#19951

Closed

1 task

🔴 Video processors as a separate class #35206

🔴 Video processors as a separate class #35206

Uh oh!

Conversation

zucchini-nlp commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 24, 2025

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented May 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zucchini-nlp commented Dec 11, 2024 •

edited

Loading

zucchini-nlp commented Jan 27, 2025 •

edited

Loading