Skip to content

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Dec 11, 2024

What does this PR do?

This is quite a breaking PR, because from now on some processor will not have a separate self.video_processor and save/load priority is given to new file names image/video_preprocessor_config.json. Save/load perserved BC and can load from any file name

Reasons for PR:

  • Currently the main issue is to allow video processors operate separately from image processors and have their own set of initial attributes, as mentioned in the issue
  • Easier handling of multimodal models and more freedom in processing when images and videos have their own logic (i.e. llava-next-video)
Description - more verbose

Fixes #33504 and makes a separate Base class for video processors, only those that are from VLMs because video-only models like ViViT don't need both, image and video processor. We can add it in other video-only them later in the run, if needed.

This idea is to keep video processors as child of image processor mixin, but the processor can use transform over one video as a 4D array instead of looping over each frame. The only transform that still works on per-frame basis is resize. It could be rewritten in torch maybe, but not worth the effort. In terms of speed, the new slow video processor performs same as old image processor, and has some speedup with larger batch sizes. The advantage here is mostly about readability, cleaner code and a separate API for videos.

For the fast video processors, it is basically same as in image except for the extra dimension we have for timeframes. Plus, I added convert_rgb in np and torch. Also, we use VideosKwargs instead of having to add a new TypedDict. See my comments below, in files changed 👇🏻

What was done?

What was done:

  • Separate VideoProcessorBase and VideoProcessorBaseFast as added, recommend to start review from there
  • Video processors have an autoclass mapping to load with AutoVideoProcessor
  • Image processor config is now called image_preprocessor_config.json. We keep BC by trying to load from new image_preprocessor_config.json if exists, otherwise fallback to preprocessor_config.json with a deprecation warning
  • Most models already had a separate video processor, so nothing changes for them except for the parent class
  • Model where one preprocessor accepted both, images and video, will and continue preprocessing both modalities, but raise a deprecation warning (Qwen2VL and VideoLlava)
  • Tests, a lot of tests changed and added 🛠️
  • The image/video/general processing tests were run on all models touched by this PR, including one slow generation test per model to make sure official checkpoints are loadable
Usage examples

How does it work when loading (same goes for model-class laoding)

AutoProcessor.from_pretrained(model_id)
AutoImageProcessor.from_pretrained(model_id, use_fast=True/False)
AutoVideoProcessor.from_pretrained(model_id, use_fast=True/False, device="cpu"/"cuda")
  1. If model was not updated and has only processor_config: the processor will grab the old preprocessor_config and use it for video and for image processing classes with shared set of args. Warning emitted!

  2. If model was updated and has only image_processor_config and video_processor_config: the processor will grab the files and load each preprocessor with their own set of args

  3. If model has all types of config files and didn't delete the old processor_config: priority will be given to (image/video)_processor_config with different sets of args for each

When saving the model back, we'll save in image_processor_config and video_processor_config files

Benchmark results on llava onevision

Benchmark on different batch sizes and number of frames per video sampled. Also with different devices and input types: could be numpy, torch or list PIL frames.
results

Next steps:

  • Need step is to add tests for video inputs in test_processing_common.py (there was a PR somewhere but prob it got stale)
  • We also might need to rename "feature extractor name" to audio_preprocessor_config, just for consistency and easier navigation in hub configs
  • The frame sampling function which we just added for video LLMs can now be a method of video processor. So each video model will have its own sampling pre-defined which imo eases stuff for instruct inference

@zucchini-nlp zucchini-nlp changed the title Video processors as a separate class 🔴 Video processors as a separate class Dec 11, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jan 27, 2025

This one is ready for review, I guess the first review will be from @qubvel . The failing tests are not related, very flaky

@zucchini-nlp zucchini-nlp requested a review from qubvel January 27, 2025 10:57
Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks for working on this feature! I did the first, not very thorough, pass, see the comments below

@ArthurZucker ArthurZucker self-requested a review May 8, 2025 09:30
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I am just wondering if we can't for now, and for models for which it makes sense, just inherit the image processor, and the base video processor when we can call the same pre processing functions!

elif is_remote_url(pretrained_model_name_or_path):
video_processor_file = pretrained_model_name_or_path
resolved_video_processor_file = download_url(pretrained_model_name_or_path)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a whole bunch of this is re-used from other processors, I really don't mind some for of inheritance or just using the same function if we can!

@ArthurZucker
Copy link
Collaborator

TLDR from internal coms:

long term we want to:

  • inherit from ImageProcessor and BaseVideoProcessor,
  • prepare the videos to call vmap and use inherited ImageProcessor defined functions

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) May 8, 2025 15:31
@zucchini-nlp zucchini-nlp disabled auto-merge May 8, 2025 15:31
@zucchini-nlp zucchini-nlp merged commit a31fa21 into huggingface:main May 12, 2025
20 checks passed
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* initial design

* update all video processors

* add tests

* need to add qwen2-vl (not tested yet)

* add qwen2-vl in auto map

* fix copies

* isort

* resolve confilicts kinda

* nit:

* qwen2-vl is happy now

* qwen2-5 happy

* other models are happy

* fix copies

* fix tests

* add docs

* CI green now?

* add more tests

* even more changes + tests

* doc builder fail

* nit

* Update src/transformers/models/auto/processing_auto.py

Co-authored-by: Pavel Iakubovskii <[email protected]>

* small update

* imports correctly

* dump, otherwise this is getting unmanagebale T-T

* dump

* update

* another update

* update

* tests

* move

* modular

* docs

* test

* another update

* init

* remove flakiness in tests

* fixup

* clean up and remove commented lines

* docs

* skip this one!

* last fix after rebasing

* run fixup

* delete slow files

* remove unnecessary tests + clean up a bit

* small fixes

* fix tests

* more updates

* docs

* fix tests

* update

* style

* fix qwen2-5-vl

* fixup

* fixup

* unflatten batch when preparing

* dump, come back soon

* add docs and fix some tests

* how to guard this with new dummies?

* chat templates in qwen

* address some comments

* remove `Fast` suffix

* fixup

* oops should be imported from transforms

* typo in requires dummies

* new model added with video support

* fixup once more

* last fixup I hope

* revert image processor name + comments

* oh, this is why fetch test is failing

* fix tests

* fix more tests

* fixup

* add new models: internvl, smolvlm

* update docs

* imprt once

* fix failing tests

* do we need to guard it here again, why?

* new model was added, update it

* remove testcase from tester

* fix tests

* make style

* not related CI fail, lets' just fix here

* mark flaky for now, filas 15 out of 100

* style

* maybe we can do this way?

* don't download images in setup class

---------

Co-authored-by: Pavel Iakubovskii <[email protected]>
@yeliudev yeliudev mentioned this pull request May 26, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Video Processor as a separate class

6 participants