Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video Processor as a separate class #33504

Open
6 tasks
zucchini-nlp opened this issue Sep 16, 2024 · 3 comments · May be fixed by #35206
Open
6 tasks

Video Processor as a separate class #33504

zucchini-nlp opened this issue Sep 16, 2024 · 3 comments · May be fixed by #35206
Assignees
Labels

Comments

@zucchini-nlp
Copy link
Member

Feature request

Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a VideoProcessor class that inherits from ImageProcessingMixin. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. The VideoProcessor will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, like load_video and make_list_of_videos. The main input name will be videos and the output var name is pixel_values_videos.

For the load_video we can prob rely on av, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunately decord can't be used as it had problems with models on cuda.

In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with torchvision.

To Do:

  • Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.

  • After the changed are approved and merged, the following models will be easy to modify:

    • Video-LLaVa
    • Qwen2-VL
    • LLaVA-OneVision
  • Instructblip-Video might need deprecation as it currently accepts images as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models

Motivation

Easier integration of multimodal LLMs

Your contribution

@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗

@amyeroberts
Copy link
Collaborator

Yes - this sounds like a great idea!

Big +1 to the separate class, and to using different video decoders if possible.

@gerrylwk
Copy link

I think it would be good to have a 'multi-image' option for video too, e.g. when streaming a video, there's no need to save the frames into a video file before using it for inference

@zucchini-nlp
Copy link
Member Author

@gerrylwk I'm not sure any of the supported video llms currently support streaming video input, but the idea is cool. If you have any model release on mind with such feature, feel free to open a feature request issue

@zucchini-nlp zucchini-nlp linked a pull request Dec 11, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants