Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Jun 11, 2025

Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D. Bastian, Alvaro Velasquez, Susmit Jha

Figure 1 for TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Figure 2 for TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Figure 3 for TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Figure 4 for TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Share this with someone who'll enjoy it:

Abstract:We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.

View paper on

Share this with someone who'll enjoy it:

Title:TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Paper and Code