The Video-Question-Answering-Resources repository is a curated guide for beginners and researchers interested in the Video Question Answering (VQA) field. It provides an organized collection of the most relevant papers, models, datasets, and additional resources to help users understand and contribute to this evolving area. The repository focuses on the intersection of computer vision and natural language processing, particularly how video data can be used to answer complex questions, offering a range of materials from introductory guides to advanced research. (Last Update on 09/22/2025)
Video question answering (VideoQA), LLMs, Long video understanding, Spatial Reasoning, Temporal Reasoning, Multi-Choice QA, Open-Ended QA;
Bharatesh Chakravarthi, Ph.D
Joseph Raj Vishal
- Beginners Guide to Video-Question-Answering
- Publications
- Survey/Review Papers
- Conference/Journal Papers
- Datasets
- Models
- Additional Resources
-
Answering Questions from YouTube Videos with OpenAI Whisper and GPT-4 (Medium article)
-
Try a quick example on how to use LLMs for Video Question Answering here (Check Additional Resources for API key)
- Video Understanding with Large Language Models: A Survey (2025) [Paper]
- VideoQA in the Era of LLMs:An Empirical Study (2025) [Paper]
- A Survey on Generative AI and LLM for Video Generative Understanding, and Streaming (2024) [Paper]
- Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey (2021) [Paper]
- Video Question Answering: a Survey of Models and Datasets (2021) [Paper]
- A survey on VQA: Datasets and approaches (2020, ITCA) [Paper]
- RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives [Paper]
- MovieChat+: Question-aware Sparse Memory for Long Video Question Answering [Paper]
- Object-centric Video Question Answering with Visual Grounding and Referring [Paper]
- VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering [Paper]
- LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [Paper]
- Enhancing Long Video Question Answering with Scene-Localized Frame Grouping [Paper]
- VisiQuest:Video-Question Answering with Advanced Vision-Language AI [Paper]
- Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning (CVPR) [Paper]
- Question-Answering Dense Video Events (ACM) [Paper]
- CogStream: Context-guided Streaming Video Question Answering [Paper]
- VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos [Paper]
- Advancing Egocentric Video Question Answering with Multimodal Large Language Models [Paper]
- Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering [Paper]
- MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering [Paper]
- MELA: Multi-Event Localization Answering Framework for Video Question Answering [Paper]
- HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [Paper]
- BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [Paper]
- Semantic Distance-Aware Cross-Modal Attention Mechanism for Video Question Answering [Paper]
- Empowering LLMs with pseudo-untrimmed videos for audio-visual temporal understanding [Paper]
- Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering [Paper]
- Agentic Keyframe Search for Video Question Answering [Paper]
- Admitting Ignorance Helps the Video Question Answering Models to Answer [Paper]
- VQALS: A Video Question Answering Method in Low-Light Scenes Based on Illumination Correction and Feature Enhancement [Paper]
- Keyframe-oriented vision token pruning: Enhancing efficiency of large vision language models on long-form video processing[Paper]
- TUMTraffic-Videoqa: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes[Paper]
- EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering [Paper]
- ReasVQA:Advancing VideoQA with Imperfect Reasoning Process[Paper]
- Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs [Paper]
- Unhackable temporal rewarding for scalable video MLLMs [Paper]
- Grounded multi-hop videoqa in long-form egocentric videos[Paper]
- VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding [Paper]
- Videoqa-SC: Adaptive Semantic Communication for Video Question Answering [Paper]
- Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering [Paper]
- Towards Fine-Grained Video Question Answering [Paper]
- Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [Paper]
- REVEAL: Relation-based Video Representation Learning for Video-Question-Answering [Paper]
- VideoQA-TA: Temporal-Aware Multi-Modal Video Question Answering [Paper]
- VideoMultiAgents: A Multi-Agent Framework for Video Question Answering [Paper]
- Open-Ended and Knowledge-Intensive Video Question Answering [Paper]
- A CLIP-based Video Question Answering framework with Explainable AI [Paper]
- Dam: Dynamic Adapter Merging for Continual Video QA Learning [Paper]
- TimeLogic: A Temporal Logic Benchmark for Video QA [Paper]
- Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [Paper]
- SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [Paper]
- LongViTU: Instruction Tuning for Long-Form Video Understanding [Paper]
- Cross-modal Causal Relation Alignment for Video Question Grounding [Paper]
- (Our Paper) Eyes on the Road:State-of-the-art Video Question Answering Models Assessment for Traffic Monitoring Tasks [Paper]
- AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering [Paper]
- An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM [Paper]
- Language-aware Visual Semantic Distillation for Video Question Answering (CVPR) [Paper]
- Pre-trained Bidirectional Dynamic Memory Network For Long Video Question Answering (CVPR) [Paper]
- Enhancing machine vision: the impact of a novel innovative technology on video question-answering [Paper]
- LONGVIDEOBENCH: A Benchmark for Long-context Interleaved Video-Language Understanding [Paper]
- TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning [Paper]
- MoReVQA: Exploring Modular Reasoning Models for Video Question Answering (CVPR) [Paper]
- Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question Answering (CVPR) [Paper]
- Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR) [Paper]
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (CVPR) [Paper]
- Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering [Paper]
- LVBench: An Extreme Long Video Understanding Benchmark [Paper]
- Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding [Paper]
- Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [Paper]
- CinePile: A Long Video Question Answering Dataset and Benchmark [Paper]
- Video-Language Alignment via Spatio-Temporal Graph Transformer [Paper]
- Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [Paper]
- VideoChat: Chat-Centric Video Understanding [Paper]
- LITA: Language Instructed Temporal-Localization Assistant [Paper]
- Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports [Paper]
- Videoagent: Long-form Video Understanding with Large Language Model as Agent [Paper]
- AMEGO: Active Memory from Long EGOcentric Videos [Paper]
- Video Instruction Tuning With Synthetic Data [Paper]
- Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [Paper]
- Video-MME:The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [Paper]
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites.[Paper]
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. [Paper]
- VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs [Paper]
- TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [Paper]
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context [Paper]
- VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models [Paper]
- ViLA: Efficient video-language alignment for video question answering(ECCV24) [Paper]
- STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering [Paper]
- STAR: A Benchmark for Situated Reasoning in Real-World Videos [Paper]
- LongVLM:Efficient Long Video Understanding via Large Language Models [Paper]
- FunQa:Towards Surprising Video Comprehension [Paper]
- Locate Before Answering: Answer Guided Question Localization for Video Question Answering [Paper]
- Large Language Models are Temporal and Causal Reasoners for Video Question Answering [Paper]
- Glance and Focus: Memory Prompting for Multi-Event Video Question Answering (NeurIPS) [Paper]
- Traffic-Domain Video Question Answering with Automatic Captioning [Paper]
- Zero-Shot Video Question Answering with Procedural Programs [Paper]
- Learning Situation Hyper-Graphs for Video Question Answering (CVPR) [Paper]
- Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering (CVPR) [Paper]
- Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models (CVPR) [Paper]
- ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos (CVPR) [Paper]
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering (CVPR) [Paper]
- Discovering Spatio-Temporal Rationales for Video Question Answering (ICCV) [Paper]
- Egoschema: A Diagnostic Benchmark for Very Long-Form Video Language Understanding (NeurIPS) [Paper]
- Visual Instruction Tuning (NeurIPS) [Paper]
- A Simple LLM Framework for Long-Range Video Question-Answering (Preprint) [Paper]
- A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [Paper]
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [Paper]
- Video Question Answering Using CLIP-Guided Visual-Text Attention [Paper]
- Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data [Paper]
- Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization [Paper]
- Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question-Answering [Paper]
- A Dataset for Medical Instructional Video Classification and Question Answering [Paper]
- Invariant Grounding for Video Question Answering (CVPR) [Paper]
- Video Question Answering With Prior Knowledge and Object-Sensitive Learning [Paper]
- Video Question Answering with Iterative Video-Text Co-tokenization [Paper]
- (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering (AAAI) [Paper]
- ERM: Energy-Based Refined-Attention Mechanism for Video Question Answering [Paper]
- CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering [Paper]
- Measuring Compositional Consistency for Video Question Answering (CVPR) [Paper]
- From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question Answering (CVPR) [Paper]
- Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS) [Paper]
- Dynamic Spatio-Temporal Modular Network for Video Question Answering [Paper]
- Ego4D:Around the World in 3000 Hours of EgoCentric Video[Paper]
- Flamingo: A Visual Language Model for Few-Shot Learning [Paper]
- Saying the Unseen: Video Descriptions via Dialog Agents [Paper]
- Learning to Answer Visual Questions from Web Videos [Paper]
- In-the-Wild Video Question Answering [Paper]
- FIBER:Fill-in-the-Blanks as a Challenging Video Understanding Framework [Paper]
- VQuAD: Video Question Answering Diagnostic Dataset [Paper]
- NEWSKVQA:Knowledge-Aware News Video Question Answering [Paper]
- Learning to Answer Questions in Dynamic Audio-Visual Scenarios (CVPR) [Paper]
- DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering [Paper]
- Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering (CVPR) [Paper]
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR) [Paper]
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling (CVPR) [Paper]
- AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning (CVPR) [Paper]
- On the Hidden Treasure of Dialog in Video Question Answering (ICCV) [Paper]
- Self-Supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA (AAAI) [Paper]
- TruMan: Trope Understanding in Movies and Animations [Paper]
- Perceiver IO: A General Architecture for Structured Inputs & Outputs[Paper]
- VideoGPT:Video Generation using VQ-VAE and Transformers [Paper]
- Clip4clip: An empirical study of clip for end-to-end video clip retrieval and captioning.(ACM) [Paper]
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [Paper]
- Just Ask: Learning to Answer Questions from Millions of Narrated Videos [Paper]
- AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant [Paper]
- SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events. (CVPR) [Paper]
- Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments (CVPR)[Paper]
- Progressive Graph Attention Network for Video Question Answering [Paper]
- Transferring Domain=-Agnostic Knowledge in Video Question Answering [Paper]
- Video Question Answering with Phrases via Semantic Roles [Paper]
- BERT Representations for Video Question Answering (WACV) [Paper]
- Hierarchical Conditional Relation Networks for Video Question Answering (CVPR) [Paper]
- Location-Aware Graph Convolutional Networks for Video Question Answering (AAAI) [Paper]
- Action-Centric Relation Transformer Network for Video Question Answering [Paper]
- Long video question answering: A Matching-guided Attention Model [Paper]
- KnowIT VQA: Answering Knowledge-Based Questions about Videos (AAAI) [Paper]
- Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering (AAAI) [Paper]
- Video Question Answering for Surveillance (TechRxiv - Not Peer Reviewed) [Paper]
- The MSR-Video to Text Dataset with Clean Annotations [Paper]
- HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (ACL) [Paper]
- CLEVRER: CoLlision Events for Video REpresentation and Reasoning [Paper]
- LifeQA: A Real-Life Dataset for Video Question Answering [Paper]
- TutorialVQA: Question Answering Dataset for Tutorial Videos [Paper]
- Video Question Answering on Screencast Tutorials(ACM)[Paper]
- Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning.[Paper]
- DramaQA: Character-Centered Video Story Understanding with Hierarchical QA.[Paper]
- TVQA+: Spatio-Temporal Grounding for Video Question Answering [Paper]
- Frame Augmented Alternating Attention Network for Video Question Answering [Paper]
- Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering [Paper]
- EgoVQA: An Egocentric Video Question Answering Benchmark Dataset (CVPR) [Paper]
- Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering (AAAI) [Paper]
- Compositional Attention Networks with Two-Stream Fusion for Video Question Answering [Paper]
- Learning to Reason with Relational Video Representation for Question Answering [Paper]
- Video Question Answering with Spatio-Temporal Reasoning [Paper]
- Spatio-Temporal Relation Reasoning for Video Question Answering [Paper]
- Moments in Time Dataset: one million videos for event understanding [Paper]
- Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence [Paper]
- Motion-Appearance Co-Memory Networks for Video Question Answering (CVPR) [Paper]
- Multimodal Dual Attention Memory for Video Story Question Answering (CVPR) [Paper]
- TVQA: Localized,Compositional Video Question Answering [Paper]
- Explore Multi-Step Reasoning in Video Question Answering [Paper]
- Towards Automatic Learning of Procedures From Web Instructional Videos [Paper]
- Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction [Paper]
- On the effectiveness of task granularity for transfer learning [Paper]
- Unifying the Video and Question Attentions for Open-Ended Video Question Answering [Paper]
- Video Question Answering Using a Forget Memory Network [Paper]
- A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question Answering (CVPR) [Paper]
- MarioQA: Answering Questions by Watching Gameplay (CVPR) [Paper]
- Leveraging Video Description to Learn Video Question Answering (AAAI) [Paper]
- Video Question Answering via Gradually Refined Attention over Appearance and Motion [Paper]
- DeepStory: Video Story QA by Deep Embedded Memory Networks [Paper]
- Video Question Answering via Hierarchical Spatio-Temporal Attention Networks [Paper]
- The "something something" video database for learning and evaluating visual common sense [Paper]
- Video Question Answering via Attribute-Augmented Attention Network Learning(ACM) [Paper]
- MovieQA: Understanding Stories in Movies through Question-Answering (CVPR) [Paper]
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR) [Paper]
- TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR) [Paper]
- Uncovering Temporal Context for Video Question Answering [Paper]
| Year | Name | Key Features |
|---|---|---|
| 2025 | RoadSocial | RoadSocial is a large-scale, diverse VideoQA resource for road events, derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 unique tags, and 260K high-quality QA pairs. |
| 2025 | CrossVideoQA | CrossVideoQA is a person-centric cross-video QA benchmark combining EOSD (surveillance) and HACS (web actions). EOSD: 20 videos across 3 indoor locations over 12 dates (~450K frames), suited for multi-day behavior analysis. HACS: 50K web videos with 1.55M action clips, offering high visual and semantic diversity. |
| 2025 | LVSQA | LVSQA is a long-video, scene-level QA dataset with 100 ≥30-minute videos (from LVBench) and 500 human-refined QA pairs designed from a purely visual perspective (minimal subtitle reliance). It targets detailed understanding—scene localization and fine-grained visual reasoning in long videos—using an MLLM-assisted, expert-edited creation pipeline. |
| 2025 | DeVE-QA | The DeVE-QA is a dataset featuring 78𝐾 questions about 26𝐾 events on 10.6𝐾 long videos. |
| 2025 | CogStream | CogStream features a collection of 6,361 videos from six public sources: MovieChat (40.2%), MECD (16.8%), QVhighlights (9.8%), VideoMME (6.5%), COIN (18.0%), and YouCook2 (8.6%). Scale: The final dataset comprises 1,088 high-quality videos and 59,032 QA pairs, formally split into a training set (852 videos) and a testing set (236 videos). |
| 2024 | NExT-GQA | The NExT-GQA dataset augments the NExT-QA dataset with temporal labels for Causal (“why/how”), Temporal (“before/when/after”) type questions. The annotations are done in a weakly supervised setup by labeling validation and test sets. 8,911 QA pairs from 1,557 videos are annotated with 10,531 valid temporal segments. |
| 2024 | MVBench | The MVBench dataset focuses on evaluating multi-modal video understanding by covering 20 complex video tasks that emphasize temporal reasoning, from perception to cognition.The MVBench dataset includes over 566,747 video clips from diverse sources, such as COCO, WebVid, YouCook2, and more. The dataset also covers a wide variety of task types, such as question-answering, captioning, and conversation tasks, with more than 200 multiple-choice questions generated for each temporal understanding task |
| 2024 | LVBench | The LVBench dataset consists of 103 videos, each with a minimum duration of 30 minutes. There are a total of 1549 question-answer pairs associated with these videos, with an average of 24 questions per hour of video content |
| 2024 | FunQA | FunQA is a video question-answering dataset featuring 4.3K counter-intuitive and humorous video clips with 312K free-text QA pairs, an average answer length of 34.2 words, and subsets like HumorQA, CreativeQA, and MagicQA highlighting humor, creativity, and magic-themed reasoning |
| 2024 | MedVidQA | MedVidQA dataset comprises 3,010 human-annotated instructional questions and visual answers from 900 health-related videos.This dataset forms apart of the challenge of two tasks,medical instructions question generation and Video Corpus Visual Answer Localization (VCVAL). |
| 2024 | Video-MME | The Video-MME is a comprehensive benchmark designed to evaluate Multi-Modal Large Language Models (MLLMs) in video analysis.Covers short (< 2min), medium (4-15min), and long (30-60min) videos to test MLLMs' ability to process varying time frames. This includes 6 primary domains, such as Knowledge, Film and TV, Sports, Life Records, and Multilingualism, with 30 subfields, ensuring broad generalizability. Integrates video frames, subtitles, and audio. |
| 2024 | CinePile | The CinePile dataset consists of 9,396 movie clips sourced from the Movieclips YouTube channel, divided into training and testing splits of 9,248 and 148 videos, respectively. Through a question-answer generation and filtering pipeline, the dataset produced 298,888 training points and 4,940 test-set points, averaging 32 questions per video scene. |
| 2024 | LongVideoBench | LongVideoBench is a long-context video–language QA benchmark with interleaved inputs up to 1 hour, comprising 3,763 web-collected videos (with subtitles) and 6,678 human-annotated multiple-choice questions across 17 categories; it introduces referring reasoning, which requires retrieving and reasoning over detailed, temporally grounded contexts from lengthy inputs. |
| 2023 | TextVR | The TextVR dataset is a large-scale cross-modal video retrieval dataset, containing 42,200 sentence queries for 10,500 videos across eight scenario domains, including Street View, Game, Sports, Driving, Activity, TV Show, and Cooking. |
| 2023 | Social-IQ-2.0 | This dataset is from the Social IQ challenge, consisting of 1000 videos,6000 questions and 24,000 answers. This challenge was co-hosted with the Artificial Social Intelligence Workshop at ICCV'23 |
| 2023 | VideoChat | VideoChat is a video-centric multimodal instruction data based on WebVid-10M. The project features a 100K video-instruction dataset created using human-assisted and semi-automatic annotation techniques. |
| 2022 | Ego4D | Ego4D is a comprehensive egocentric video dataset comprising 3,670 hours of daily-life activities recorded by 931 camera wearers across 74 locations in 9 countries, covering various scenarios like household, outdoor, and workplace settings. |
| 2022 | NEWSKVQA | NEWSKVQA is a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities. |
| 2022 | MedVidQACL | This dataset consists of Medical Instructional Videos and Questions based on those videos consists of 899 Videos each of 4 mins and 3K Questions manually annotated |
| 2022 | FIBER | The FIBER dataset consists of 28,000 videos and description.The dataset consists of MCQ-type questions as well as video captioning data.Consists of 28K videos and 28K questions each of 10 seconds duration |
| 2022 | Causal-VidQA | This dataset consists of 26K Videos with 107K questions.Manually annotated. |
| 2022 | MUSIC-AVQA | This dataset consists of 9.3K Music Video each 60s long with 45K Manually annotated. |
| 2022 | VQuAD | This dataset consists of 7K videos with 1.3Million Questions offering spatial and temporal properties.It consist of Synthetic Videos |
| 2022 | CRIPP-VQA | CRIPP-VQA is a VideoQA dataset for counterfactual reasoning about implicit physical properties, containing 4,000 training, 500 validation, and 500 test videos, plus ≈2,000 videos for out-of-distribution evaluation. The training split includes 41,761 descriptive questions, 41,761 counterfactual questions, and 10,440 planning-based questions. |
| 2022 | STAR | The STAR is a dataset for Situated Reasoning, which provides challenging question-answering tasks, symbolic situation descriptions and logic-grounded diagnosis via real-world video situations. It consists of 4 Question Types, 60K Situated Questions , 23K Situation Video Clips and 140K Situation Hypergraphs |
| 2022 | In-the-Wild | This consists of dataset with videos recorded outdoors(survival , agriculture,natural disaster and military),Consists of 369 videos with 916 questions each about a minute and 10 seconds long. |
| 2022 | AGQA 2.0 | AGQA 2.0 is the succeeding dataset of AGQA.With this dataset, there exists a benchmark of 96.85M question-answer pairs and a balanced subset of 2.27M question-answer pairs |
| 2022 | WebVidVQA3M Data | Consists of Web Videos 2M with 3M Question and each video 4 mins long.This consists of automatically tagged videos |
| 2021 | HowToVQA69M Data | Consists of 69M videos, with 69M questions each video being 2 minutes long,This consists of manually tagged videos |
| 2021 | iVQA | This dataset consists of 10K videos with 10K questions each of 8 minutes long. |
| 2021 | PanoAVQA | PanoAVQA dataset consists of 360 degree panoramic videos.Consists of a total of 5.4K videos and 20K spatial and 31.7K audio-video QAs. |
| 2021 | AGQA | Action Genome Question Answering (AGQA) is a benchmark for compositional spatiotemporal reasoning. AGQA contains 192M unbalanced question-answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question-answer pairs |
| 2021 | Video-QAP | VideoQAP dataset consists of Web videos consisting of 35K Videos and 162K Questions each 36.2 seconds long |
| 2021 | KnowIT-X-VQA | An extension of KnowIT dataset.this dataset consists of TV videos (12.1K) and 21.4K questions. |
| 2021 | Charades-SRL-QA | These consists of Charades , HomeMade videos with 9.5K videos with 71K questions each 29 seconds long. |
| 2021 | NExTQA | The NExT-QA dataset comprises 5,440 videos, split into 3,870 for training, 570 for validation, and 1,000 for testing. It features around 52,044 question-answer pairs, with approximately 47,692 for multiple-choice QA and 52,044 for open-ended QA. The questions are divided into three main types: causal questions (48% of the dataset), temporal questions (29%), and descriptive questions (23%). |
| 2021 | LSMDC-QA (Requires request access) | LSMDC-QA (Large Scale Movie Description Challenge) contains 118,081 short video clips extracted from 202 movies. It consists of 7408 clips, and evaluation is performed on a test set of 1000 videos from movies disjoint. |
| 2021 | Env-QA | Env-QA consists of 23.3K videos collected in AI2-THOR simulator and 85.1K questions |
| 2021 | SUTD-TrafficQA | SUTD-TrafficQA takes the form of videoQA based dataset consists of 10080 in the wild videos and annotated 62535 QA pairs for complex traffic based scenarios. |
| 2020 | CLEVRER | CLEVRER focuses on temporal reasoning and inferencing of synthetic videos. Consists of 10K videos 305K Questions each of 5 seconds duration |
| 2020 | LifeQA | LifeQA consists of videos of data-to-data activities.It consists 275 video clips and over 2.3k multiple-choice questions. |
| 2020 | How2R-and-How2QA | The How2R and How2QA datasets contain 9,371 and 9,035 episodes, with 24,328 and 21,509 clips averaging around 17 seconds each, divided into training, validation, and testing sets. |
| 2021 | TGIF-QA-R | This dataset consists of 71K GIFs each of 3 seconds long and 165K Questions.Extended version of the TGIF-QA dataset. |
| 2020 | DramaQA | DramaQA dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various-length video clips, with each QA pair belonging to one of four difficulty levels. |
| 2020 | KnowITVQA | KnowITVQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory consisting of 207 videos each 20 minutes long |
| 2020 | V2C-QA | Video2CommonSense dataset consists of Web Videos for image captioning and VideoQA consists of 1.5K videos and 37K questions. |
| 2020 | PsTuts-VQA | The PsTuts dataset includes the following resources: 76 videos (5.6 hours in total), 17,768 question-answer pairs, and a domain knowledge-base with 1,236 entities and 2,196 options.It focuses on video tutorials |
| 2019 | Social-QA | This Kaggle repository consists of the Social-QA dataset. Social-IQ contains 1,250 natural in-the-wild social situations, 7,500 questions and 52,500 correct and incorrect answers. |
| 2019 | TutorialVQAD | TutorialVQAD consists of tutorial pertaining to image editing software. Total number of videos 76 and total number of questions 6195 |
| 2019 | AVSD (Audio-Visual Scene-Aware Dialog) | AVSD is a dialog dataset grounded in the Charades human-activity videos, comprising dialogs about 11,816 short indoor videos (avg. length ~30 s; at least 2 actions per video). Each dialog discusses the video’s events and objects across multiple turns. |
| 2019 | Moments in Time Dataset | The Moments in Time dataset consists of one million videos, each 3 seconds long, with 339 different classes. |
| 2018 | TVQA | TVQA is a large-scale video question-answering dataset built from six popular TV shows, including Friends, The Big Bang Theory, and How I Met Your Mother. It contains 152.5K QA pairs sourced from 21.8K video clips, covering over 460 hours of content. |
| 2018 | SVQA | SVQA dataset consists of Attribute comparison, count, integer comparison, exist and query type questions.This consists of synthetic videos almost 12K and 118K Questions |
| 2018 | YouCook2 | YouCook2 is one of the largest instructional video datasets focused on task-oriented cooking, featuring 2,000 untrimmed videos from 89 recipes, with an average of 22 videos per recipe. Each video, averaging 5.26 minutes and totalling 176 hours, includes annotated procedure steps with their corresponding temporal boundaries. |
| 2018 | TVQA+ | TVQA+ includes 29.4K multiple-choice questions grounded in both temporal and spatial domains. A set of visual concept words—objects and people—are identified to collect spatial groundings, and corresponding object regions in individual frames are annotated with bounding boxes. |
| 2017 | TGIF-QA | TGIF-QA, a large-scale dataset, contains 165K question-answer pairs based on animated GIFs, testing video-based Visual Question Answering (VQA) across four question types: Repetition Count, Repeating Action, State Transition, and Frame QA. |
| 2017 | MarioQA | MarioQA is a dataset specifically designed for video-based question-answering in the context of Super Mario Bros. gameplay, containing over 70,000 question-answer pairs linked to gameplay footage. |
| 2017 | VideoQA | VideoQA dataset of 18100 automatically crawled user-generated videos and titles.Videos collected from web videos with 174k questions each of 90s each |
| 2017 | Something-Something v1 & v2 | Something-Something is a collection of 220,847 labelled video clips of humans performing predefined basic actions with everyday objects. The dataset comprises 220,847 videos divided into a training set of 168,913, a validation set of 24,777, and a test set of 27,157 (without labels), totalling 174 unique labels. |
| 2016 | MSVD-QA | The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset derived from the Microsoft Research Video Description (MSVD) dataset, which includes around 120K sentences describing over 2,000 videos snippets. The dataset includes 1,970 video clips and approximately 50.5K QA pairs. |
| 2016 | MSRVTT-QA | MSRVTT-QA consists of 10K web video clips with a total duration of 41.2 hours. It spans 200k clip-sentence pairs. Each video clip is annotated with about 20 natural sentences. |
| 2016 | MovieQA | The MovieQA dataset is designed for movie question answering, aimed at evaluating automatic story comprehension through both video and text. It contains nearly 15,000 multiple-choice questions derived from over 400 movies. |
| 2016 | PororoQA | The Pororo dataset based on children's cartoons features a simple story structure with episodes averaging 7.2 minutes, where similar events are frequently repeated. The dataset comprises 8,834 QA pairs, with an average of 51.66 questions per episode, excluding ambiguous or unrelated questions. |
| 2015 | VideoQA(FIB) | This dataset consists of VideoQA, from multiple sources with videos 109K video clips and duration of over 1000 hours with 390744 questions. |
| 2014 | Activity Net | ActivityNet is a large-scale video benchmark for human activity understanding. ActivityNet aims to cover a wide range of complex human activities. ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. |
| 2013 | YouTube2Text-QA | YouTube2Text data consists of 1987 videos with 122708 descriptions.These include short descriptions of videos. |
| Model Name | Links |
|---|---|
| InternVL | Hugging Face , GitHub |
| LLaVa | Hugging Face , GitHub |
| LITA | GitHub |
| End2End ChatBot | Hugging Face , GitHub |
| VideoLLAMA2 | Hugging Face, GitHub |
| FrozenBiLM | GitHub |
| PercieverIO | Hugging Face,GitHub |
| InstructBlipVideo | Hugging Face , GitHub |
| VideoGPT | Hugging Face,GitHub |
| Qwen2-VL | Hugging Face,GitHub |
| ViLA | GitHub |
| LongVLM | GitHub |
| Model Name | API Link |
|---|---|
| ChatGPT | Here |
| Gemini | Here |
| Llama 3.2 | Here |