UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Paper accepted to CVPR 2026 Findings

Joseph Raj Vishal Siri Poluri Katha Naik Rutuja Patil Krishna Vinod Kashyap Hegde Kota Pritvi Jai Mohammad Farhadi Yezhou Yang Bharatesh Chakravarthi

Arizona State University

Paper

🤗

Dataset Tool Code

🤗

Checkpoint

UD-VideoQA is a curated, publicly available traffic monitoring dataset gathered using ARGOS Cameras and mobile devices under diverse weather and lighting conditions. It comprises 8 hours of real-world footage from multiple intersections, segmented into 10-second clips, and features over 25,000 question-answer pairs covering spatiotemporal dynamics, vehicle interactions, and incident detection. This dataset enables the benchmarking and enhancement of VideoQA models for intelligent transportation systems. It includes five QA types: (1) attribution, (2) counting, (3) event reasoning, (4) reverse reasoning, and (5) counterfactual inference.

Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces UD-VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The UD-VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on UD-VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on UD-VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. UD-VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world-deployable VideoQA models for intelligent transportation systems.

Dataset Overview

A chart compares worst-to-best performance score ranges (0–100%) for 10 video models, with open-source models in orange and proprietary models in purple, and Gemini 2.5 Pro scoring highest.

UD-VideoQA - Data Collection Setup and Diversity

Overview of the UD-VideoQA data collection framework, integrating traffic video recording and processing with a hybrid approach combining manual labeling and LLM-based automation. The pipeline segments eight hours of footage into 10-second clips, extracts key metadata (e.g., vehicle attributes, movement patterns, pedestrian data), and generates structured question-answer pairs covering attribution, counting, reverse reasoning, event reasoning, and counterfactual inference.

UD-VideoQA - Video Question Generation LeaderBoard

Model Name	Relevance					Answerability					Diversity					Pedestrian Centric					Vehicle Centric					Specific (Grounded)					Generic
Model Name	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI	BU	Atr	ER	RR	CI
Gemini 2.5 Pro	92	84	82	86	80	90	80	80	84	82	64	62	66	64	60	35	30	45	40	50	55	60	50	55	45	70	80	65	60	55	30	20	35	40	45
Qwen3 Max	82	84	84	85	85	92	88	88	88	88	50	50	48	50	50	33	41	49	46	43	67	59	51	54	57	38	45	56	57	49	62	55	44	43	51
Gemini Flash	78	84	82	84	80	77	81	80	83	82	64	43	50	47	41	36	43	50	47	41	64	57	50	53	59	39	45	58	56	50	61	55	42	44	51
Qwen3-VL-235B	70	76	79	81	74	76	80	80	82	77	56	59	62	63	58	31	44	49	46	42	69	48	55	58	50	42	48	55	58	50	58	52	45	42	50
Gemini 2.5 Flash	74	70	72	68	72	86	85	84	83	85	34	34	34	33	33	38	44	49	46	42	62	56	51	54	58	41	47	55	57	46	59	53	45	43	54
Qwen3-VL-30B	72	78	77	79	74	76	80	79	81	78	47	48	50	50	50	34	43	51	48	45	66	57	49	52	55	39	46	56	58	50	61	54	44	42	50
GPT-5	71	73	75	77	72	77	79	80	82	78	53	55	57	59	55	34	42	47	46	43	66	58	53	54	57	38	44	56	58	48	62	58	44	42	52
GPT-4o	39	41	38	34	35	52	58	53	60	51	7	10	8	11	9	19	13	45	38	50	51	85	53	61	40	0	1	17	56	77	100	98	82	43	23

The VideoQGen Benchmark. Evaluating multimodal models across reasoning types, including basic understanding (BU), attribution (Atr), event reasoning (ER), reverse reasoning (RR), and counterfactual inference (CI). Gemini 2.5 Pro leads overall across relevance, answerability, and diversity dimensions, showing strong consistency across all reasoning categories.

Overview of the UD-VideoQA dataset, which comprises 28,800 question-answer pairs across various reasoning categories. A higher concentration appears in counting, attribute recognition, and event reasoning, followed by counterfactual inference and reverse reasoning (3a). Figures 3(b)-(d) illustrate the dataset's emphasis on vehicular-related questions, the dominance of attribution and event reasoning categories, and the distribution of question types ("what," "where," and "how"). This structured approach supports the analysis of complex, multi-event traffic scenarios, requiring robust spatio-temporal reasoning. A rigorous human and GPT-assisted validation process ensures the consistency, accuracy, and reliability of all annotations.

The VideoQA Benchmark

Model Type	Model Name	Morning						Afternoon						Evening
Model Type	Model Name	BU	Atr	ER	RR	CI	Overall	BU	Atr	ER	RR	CI	Overall	BU	Atr	ER	RR	CI	Overall
Proprietary	Gemini 2.5 Pro	90.00	22.22	88.89	86.10	91.70	75.78	78.27	65.44	75.42	73.81	72.56	73.10	87.39	50.00	62.09	76.37	79.80	71.13
	Gemini 2.5 Flash	80.00	22.22	77.78	77.78	89.88	69.53	82.87	75.61	74.91	75.65	65.85	74.98	50.00	50.00	68.68	72.53	61.60	60.56
	GPT-5	80.00	25.00	50.00	63.89	82.22	60.22	74.83	69.96	70.68	70.13	68.90	70.90	58.79	50.00	47.80	54.94	58.50	54.01
	GPT-4o	75.00	25.00	69.44	47.22	85.88	60.51	58.74	19.68	35.32	22.56	45.61	36.38	50.00	8.24	56.59	58.79	76.38	50.00
Open Source	Qwen 2.5 32B	75.66	36.11	66.67	25.00	77.78	56.24	64.37	48.59	53.66	44.17	51.83	52.52	72.07	40.78	45.96	78.03	53.16	58.00
	Qwen 2.5 7B (Fine-Tuned)	74.80	16.67	75.56	50.00	77.78	58.96	63.22	53.11	54.27	39.88	53.66	52.83	66.07	71.30	56.59	58.79	54.17	61.38
	VideoLLaMA3	65.60	22.22	58.33	58.33	69.34	54.76	59.77	36.72	49.39	28.83	48.78	44.70	66.21	50.00	53.84	58.79	49.87	55.74
	NVILA 8B	55.32	22.22	72.22	47.22	53.33	50.06	64.37	33.90	54.88	30.06	52.44	47.13	52.31	49.82	49.19	45.47	50.00	49.44
	InternVL3 38B	70.00	36.11	47.22	66.67	67.22	57.44	61.49	36.72	46.95	26.99	59.15	46.26	38.33	47.22	53.33	65.00	40.02	48.78
	LLaVA-NeXT-Video 7B	44.80	2.78	63.89	33.33	22.22	33.40	45.40	16.95	23.78	16.56	25.61	25.66	30.10	1.10	54.40	23.63	5.49	22.94

The VideoQA Benchmark. It presents comparative performance across proprietary and open-source video reasoning models evaluated under varying illumination conditions: high intensity (afternoon), medium intensity (morning), and low intensity (evening). Overall results show that Gemini 2.5 Pro achieved the highest accuracy in the Morning (75.78%) and Evening (71.13%) conditions, while Gemini 2.5 Flash scored the highest in the Afternoon (74.98%). Llava-NeXT-Video exhibited the weakest overall performance.

Sample Data Recordings

BibTeX

@misc{vishal2026udvideoqatrafficvideoquestion,
      title={UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics}, 
      author={Joseph Raj Vishal and Nagasiri Poluri and Katha Naik and Rutuja Patil and Kashyap Hegde Kota and Krishna Vinod and Prithvi Jai Ramesh and Mohammad Farhadi and Yezhou Yang and Bharatesh Chakravarthi},
      year={2026},
      eprint={2602.21137},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.21137}, 
}