UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Paper accepted to CVPR 2026 Findings


Joseph Raj Vishal       Siri Poluri       Katha Naik        Rutuja Patil        Krishna Vinod        Kashyap Hegde Kota        Pritvi Jai        Mohammad Farhadi        Yezhou Yang        Bharatesh Chakravarthi       
Arizona State University

UD-VideoQA is a curated, publicly available traffic monitoring dataset gathered using ARGOS Cameras and mobile devices under diverse weather and lighting conditions. It comprises 8 hours of real-world footage from multiple intersections, segmented into 10-second clips, and features over 25,000 question-answer pairs covering spatiotemporal dynamics, vehicle interactions, and incident detection. This dataset enables the benchmarking and enhancement of VideoQA models for intelligent transportation systems. It includes five QA types: (1) attribution, (2) counting, (3) event reasoning, (4) reverse reasoning, and (5) counterfactual inference.

Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces UD-VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The UD-VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on UD-VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on UD-VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. UD-VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world-deployable VideoQA models for intelligent transportation systems.

Dataset Overview


UD-VideoQA - Data Collection Setup and Diversity


Overview of the UD-VideoQA data collection framework, integrating traffic video recording and processing with a hybrid approach combining manual labeling and LLM-based automation. The pipeline segments eight hours of footage into 10-second clips, extracts key metadata (e.g., vehicle attributes, movement patterns, pedestrian data), and generates structured question-answer pairs covering attribution, counting, reverse reasoning, event reasoning, and counterfactual inference.


UD-VideoQA - Video Question Generation LeaderBoard


Model Name Relevance Answerability Diversity Pedestrian Centric Vehicle Centric Specific (Grounded) Generic
BUAtrERRRCI BUAtrERRRCI BUAtrERRRCI BUAtrERRRCI BUAtrERRRCI BUAtrERRRCI BUAtrERRRCI
Gemini 2.5 Pro 9284828680 9080808482 6462666460 3530454050 5560505545 7080656055 3020354045
Qwen3 Max 8284848585 9288888888 5050485050 3341494643 6759515457 3845565749 6255444351
Gemini Flash 7884828480 7781808382 6443504741 3643504741 6457505359 3945585650 6155424451
Qwen3-VL-235B 7076798174 7680808277 5659626358 3144494642 6948555850 4248555850 5852454250
Gemini 2.5 Flash 7470726872 8685848385 3434343333 3844494642 6256515458 4147555746 5953454354
Qwen3-VL-30B 7278777974 7680798178 4748505050 3443514845 6657495255 3946565850 6154444250
GPT-5 7173757772 7779808278 5355575955 3442474643 6658535457 3844565848 6258444252
GPT-4o 3941383435 5258536051 7108119 1913453850 5185536140 01175677 10098824323

The VideoQGen Benchmark. Evaluating multimodal models across reasoning types, including basic understanding (BU), attribution (Atr), event reasoning (ER), reverse reasoning (RR), and counterfactual inference (CI). Gemini 2.5 Pro leads overall across relevance, answerability, and diversity dimensions, showing strong consistency across all reasoning categories.


Overview of the UD-VideoQA dataset, which comprises 28,800 question-answer pairs across various reasoning categories. A higher concentration appears in counting, attribute recognition, and event reasoning, followed by counterfactual inference and reverse reasoning (3a). Figures 3(b)-(d) illustrate the dataset's emphasis on vehicular-related questions, the dominance of attribution and event reasoning categories, and the distribution of question types ("what," "where," and "how"). This structured approach supports the analysis of complex, multi-event traffic scenarios, requiring robust spatio-temporal reasoning. A rigorous human and GPT-assisted validation process ensures the consistency, accuracy, and reliability of all annotations.


The VideoQA Benchmark


Model Type Model Name Morning Afternoon Evening
BUAtrERRRCIOverall BUAtrERRRCIOverall BUAtrERRRCIOverall
Proprietary Gemini 2.5 Pro 90.00 22.22 88.89 86.10 91.70 75.78 78.27 65.44 75.42 73.81 72.56 73.10 87.39 50.00 62.09 76.37 79.80 71.13
Gemini 2.5 Flash 80.00 22.22 77.78 77.78 89.88 69.53 82.87 75.61 74.91 75.65 65.85 74.98 50.00 50.00 68.68 72.53 61.60 60.56
GPT-5 80.00 25.00 50.00 63.89 82.22 60.22 74.83 69.96 70.68 70.13 68.90 70.90 58.79 50.00 47.80 54.94 58.50 54.01
GPT-4o 75.00 25.00 69.44 47.22 85.88 60.51 58.74 19.68 35.32 22.56 45.61 36.38 50.00 8.24 56.59 58.79 76.38 50.00
Open Source Qwen 2.5 32B 75.66 36.11 66.67 25.00 77.78 56.24 64.37 48.59 53.66 44.17 51.83 52.52 72.07 40.78 45.96 78.03 53.16 58.00
Qwen 2.5 7B (Fine-Tuned) 74.80 16.67 75.56 50.00 77.78 58.96 63.22 53.11 54.27 39.88 53.66 52.83 66.07 71.30 56.59 58.79 54.17 61.38
VideoLLaMA3 65.60 22.22 58.33 58.33 69.34 54.76 59.77 36.72 49.39 28.83 48.78 44.70 66.21 50.00 53.84 58.79 49.87 55.74
NVILA 8B 55.32 22.22 72.22 47.22 53.33 50.06 64.37 33.90 54.88 30.06 52.44 47.13 52.31 49.82 49.19 45.47 50.00 49.44
InternVL3 38B 70.00 36.11 47.22 66.67 67.22 57.44 61.49 36.72 46.95 26.99 59.15 46.26 38.33 47.22 53.33 65.00 40.02 48.78
LLaVA-NeXT-Video 7B 44.80 2.78 63.89 33.33 22.22 33.40 45.40 16.95 23.78 16.56 25.61 25.66 30.10 1.10 54.40 23.63 5.49 22.94

The VideoQA Benchmark. It presents comparative performance across proprietary and open-source video reasoning models evaluated under varying illumination conditions: high intensity (afternoon), medium intensity (morning), and low intensity (evening). Overall results show that Gemini 2.5 Pro achieved the highest accuracy in the Morning (75.78%) and Evening (71.13%) conditions, while Gemini 2.5 Flash scored the highest in the Afternoon (74.98%). Llava-NeXT-Video exhibited the weakest overall performance.


Sample Data Recordings


BibTeX

@misc{vishal2026udvideoqatrafficvideoquestion,
      title={UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics}, 
      author={Joseph Raj Vishal and Nagasiri Poluri and Katha Naik and Rutuja Patil and Kashyap Hegde Kota and Krishna Vinod and Prithvi Jai Ramesh and Mohammad Farhadi and Yezhou Yang and Bharatesh Chakravarthi},
      year={2026},
      eprint={2602.21137},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.21137}, 
}