Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

1PKU     2ByteDance    3CASIA    4WHU    5NUS   
empty

TL;DR: Open-o3 Video integrates explicit spatio-temporal evidence (key timestamps and bounding boxes) into video reasoning through curated STGR dataset and a two-stage SFT-RL training strategy, achieving state-of-the-art results on V-STAR and delivering verifiable, reliable reasoning for video understanding.

Abstract

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

Demos

Each pair shows the input video (left) and the corresponding spatio-temporal grounded reasoning visualization (right). Our model not only provides textual reasoning but also highlights when (temporal evidence, i.e. timestamps) and where (spatial evidence, i.e. bounding boxes) the key events occur in the video, offering explicit, interpretable visual traces that ground the reasoning process. These examples illustrate how Open-o3 Video connects abstract reasoning with concrete, observable evidence.

Input video
Demo 7 visualization
Spatio-temporal Grounded Reasoning (GIF)
Input video
Demo 4 visualization
Spatio-temporal Grounded Reasoning (GIF)
Input video
Demo 5 visualization
Spatio-temporal Grounded Reasoning (GIF)
Input video
Demo 3 visualization
Spatio-temporal Grounded Reasoning (GIF)
Input video
Demo 6 visualization
Spatio-temporal Grounded Reasoning (GIF)
Input video
Demo 2 visualization
Spatio-temporal Grounded Reasoning (GIF)

Model Training

Stage 1: Cold-start initialization on STGR-CoT-30k equips the model with basic grounded reasoning.

Stage 2: Reinforcement learning with Group Sequence Policy Optimization stabilizes long-horizon optimization. We propose adaptive temporal proximity and temporal gating in the thinking reward design.

Inference Time Scaling

empty

Figure: The explicit evidence traces can support evidence-aware test-time scaling. At inference, we generate N responses. Each response contains spatio-temporal traces. We crop those regions and recheck their relevance to the question. And the final prediction uses confidence-weighted voting according to the feedback score. This confidence-aware scaling reduces hallucination and improves robustness compared to simple majority voting.

Experimental Results

empty

Figure: Performance on the V-STAR benchmark, which evaluates spatio-temporal reasoning across three dimensions. Chain1 denotes what-when-where, while Chain2 corresponds to what-where-when. mAM is the average of arithmetic mean, and mLGM is the average of modified logarithmic geometric mean, combining temporal and spatial alignment. Open-o3 Video sets a new state-of-the-art with mAM improved by +14.4% and mLGM by +24.2%, surpassing GPT-4o and Gemini-2-Flash. These results demonstrate that our approach brings significant advances in temporal and spatial grounding.

empty

Figure: Performance across different video understanding and temporal grounding benchmarks. Open-o3 Video achieves comparable or even superior results to other video reasoning models, while providing more intuitive spatio-temporal evidence.

BibTeX

@article{meng2025open-o3,
  title={Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence}, 
  author={Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang},
  journal={arXiv preprint arXiv:2510.20579},
  year={2025}
}