Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kristen Grauman

Progress-Aware Video Frame Captioning

Dec 03, 2024

Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman

Figure 1 for Progress-Aware Video Frame Captioning

Figure 2 for Progress-Aware Video Frame Captioning

Figure 3 for Progress-Aware Video Frame Captioning

Figure 4 for Progress-Aware Video Frame Captioning

Abstract:While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.

* Project website: https://vision.cs.utexas.edu/projects/ProgressCaptioner/

Via

Access Paper or Ask Questions

FIction: 4D Future Interaction Prediction from Video

Dec 01, 2024

Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman

Figure 1 for FIction: 4D Future Interaction Prediction from Video

Figure 2 for FIction: 4D Future Interaction Prediction from Video

Figure 3 for FIction: 4D Future Interaction Prediction from Video

Figure 4 for FIction: 4D Future Interaction Prediction from Video

Abstract:Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person's actions and their environment to predict both the 'where' and 'how' of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.

* Technical report

Via

Access Paper or Ask Questions

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Nov 13, 2024

Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman

Figure 1 for Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Figure 2 for Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Figure 3 for Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Figure 4 for Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Abstract:Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.

Via

Access Paper or Ask Questions

Human Action Anticipation: A Survey

Oct 17, 2024

Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M. Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, Miao Liu

Figure 1 for Human Action Anticipation: A Survey

Figure 2 for Human Action Anticipation: A Survey

Figure 3 for Human Action Anticipation: A Survey

Figure 4 for Human Action Anticipation: A Survey

Abstract:Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.

* 30 pages, 9 figures, 12 tables

Via

Access Paper or Ask Questions

ExpertAF: Expert Actionable Feedback from Video

Aug 01, 2024

Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman

Figure 1 for ExpertAF: Expert Actionable Feedback from Video

Figure 2 for ExpertAF: Expert Actionable Feedback from Video

Figure 3 for ExpertAF: Expert Actionable Feedback from Video

Figure 4 for ExpertAF: Expert Actionable Feedback from Video

Abstract:Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching -- expert commentary, expert video retrieval, and the first-of-its-kind expert pose generation -- outperforming strong vision-language models on both established metrics and human preference studies.

* Technical report

Via

Access Paper or Ask Questions

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Jun 13, 2024

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwarth, Kristen Grauman

Figure 1 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 2 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 3 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 4 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Abstract:Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

* Project page: https://vision.cs.utexas.edu/projects/action2sound

Via

Access Paper or Ask Questions

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Jun 11, 2024

Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman

Figure 1 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 2 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 3 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 4 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Abstract:We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when object swapping results in object shape or functionality changes. To bridge this gap, we present HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. Designed in two stages, the first stage focuses on object swapping in a single frame with HOI awareness; the model learns to adjust the interaction patterns, such as the hand grasp, based on changes in the object's properties. The second stage extends the single-frame edit across the entire sequence; we achieve controllable motion alignment with the original video by: (1) warping a new sequence from the stage-I edited frame based on sampled motion points and (2) conditioning video generation on the warped sequence. Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.

* Project website: https://vision.cs.utexas.edu/projects/HOI-Swap/

Via

Access Paper or Ask Questions

Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

May 05, 2024

Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

Abstract:Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.

Via

Access Paper or Ask Questions

ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Apr 24, 2024

Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

Figure 1 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 2 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 3 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 4 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Abstract:An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.

* Project page: https://vision.cs.utexas.edu/projects/active_rir/

Via

Access Paper or Ask Questions

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Apr 08, 2024

Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

Figure 1 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 2 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 3 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 4 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Abstract:We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.

* Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

Via

Access Paper or Ask Questions