You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a VideoProcessor class that inherits from ImageProcessingMixin. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. The VideoProcessor will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, like load_video and make_list_of_videos. The main input name will be videos and the output var name is pixel_values_videos.
For the load_video we can prob rely on av, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunately decord can't be used as it had problems with models on cuda.
In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with torchvision.
To Do:
Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.
After the changed are approved and merged, the following models will be easy to modify:
Video-LLaVa
Qwen2-VL
LLaVA-OneVision
Instructblip-Video might need deprecation as it currently accepts images as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models
Motivation
Easier integration of multimodal LLMs
Your contribution
@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗
The text was updated successfully, but these errors were encountered:
I think it would be good to have a 'multi-image' option for video too, e.g. when streaming a video, there's no need to save the frames into a video file before using it for inference
@gerrylwk I'm not sure any of the supported video llms currently support streaming video input, but the idea is cool. If you have any model release on mind with such feature, feel free to open a feature request issue
Feature request
Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a
VideoProcessor
class that inherits fromImageProcessingMixin
. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. TheVideoProcessor
will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, likeload_video
andmake_list_of_videos
. The main input name will be videos and the output var name ispixel_values_videos
.For the
load_video
we can prob rely onav
, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunatelydecord
can't be used as it had problems with models on cuda.In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with
torchvision
.To Do:
Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.
After the changed are approved and merged, the following models will be easy to modify:
Instructblip-Video might need deprecation as it currently accepts images as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models
Motivation
Easier integration of multimodal LLMs
Your contribution
@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗
The text was updated successfully, but these errors were encountered: