Video Processor as a separate class #33504

zucchini-nlp · 2024-09-16T07:26:34Z

Feature request

Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a VideoProcessor class that inherits from ImageProcessingMixin. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. The VideoProcessor will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, like load_video and make_list_of_videos. The main input name will be videos and the output var name is pixel_values_videos.

For the load_video we can prob rely on av, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunately decord can't be used as it had problems with models on cuda.

In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with torchvision.

To Do:

Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.
After the changed are approved and merged, the following models will be easy to modify:
- Video-LLaVa
- Qwen2-VL
- LLaVA-OneVision
Instructblip-Video might need deprecation as it currently accepts images as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models

Motivation

Easier integration of multimodal LLMs

Your contribution

@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-09-17T18:58:44Z

Yes - this sounds like a great idea!

Big +1 to the separate class, and to using different video decoders if possible.

gerrylwk · 2024-11-14T06:58:41Z

I think it would be good to have a 'multi-image' option for video too, e.g. when streaming a video, there's no need to save the frames into a video file before using it for inference

zucchini-nlp · 2024-11-15T11:49:48Z

@gerrylwk I'm not sure any of the supported video llms currently support streaming video input, but the idea is cool. If you have any model release on mind with such feature, feel free to open a feature request issue

zucchini-nlp added Feature request Request for a new feature Multimodal Vision labels Sep 16, 2024

zucchini-nlp mentioned this issue Sep 16, 2024

Saving and loading LlavaOnevisionProcessor results in unexpected behavior #33484

Closed

zucchini-nlp self-assigned this Sep 17, 2024

zucchini-nlp mentioned this issue Oct 4, 2024

Major VLM tracker (standardize the API) #33948

Open

zucchini-nlp mentioned this issue Oct 21, 2024

Chat template: return vectorized output in processors #34275

Open

zucchini-nlp linked a pull request Dec 11, 2024 that will close this issue

🔴 Video processors as a separate class #35206

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video Processor as a separate class #33504

Video Processor as a separate class #33504

zucchini-nlp commented Sep 16, 2024

amyeroberts commented Sep 17, 2024

gerrylwk commented Nov 14, 2024

zucchini-nlp commented Nov 15, 2024

Video Processor as a separate class #33504

Video Processor as a separate class #33504

Comments

zucchini-nlp commented Sep 16, 2024

Feature request

Motivation

Your contribution

amyeroberts commented Sep 17, 2024

gerrylwk commented Nov 14, 2024

zucchini-nlp commented Nov 15, 2024