Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dima Damen

Rank2Reward: Learning Shaped Reward Functions from Passive Video

Apr 23, 2024

Daniel Yang, Davin Tjia, Jacob Berg, Dima Damen, Pulkit Agrawal, Abhishek Gupta

Figure 1 for Rank2Reward: Learning Shaped Reward Functions from Passive Video

Figure 2 for Rank2Reward: Learning Shaped Reward Functions from Passive Video

Figure 3 for Rank2Reward: Learning Shaped Reward Functions from Passive Video

Figure 4 for Rank2Reward: Learning Shaped Reward Functions from Passive Video

Abstract:Teaching robots novel skills with demonstrations via human-in-the-loop data collection techniques like kinesthetic teaching or teleoperation puts a heavy burden on human supervisors. In contrast to this paradigm, it is often significantly easier to provide raw, action-free visual data of tasks being performed. Moreover, this data can even be mined from video datasets or the web. Ideally, this data can serve to guide robot learning for new tasks in novel environments, informing both "what" to do and "how" to do it. A powerful way to encode both the "what" and the "how" is to infer a well-shaped reward function for reinforcement learning. The challenge is determining how to ground visual demonstration inputs into a well-shaped and informative reward function. We propose a technique Rank2Reward for learning behaviors from videos of tasks being performed without access to any low-level states and actions. We do so by leveraging the videos to learn a reward function that measures incremental "progress" through a task by learning how to temporally rank the video frames in a demonstration. By inferring an appropriate ranking, the reward function is able to guide reinforcement learning by indicating when task progress is being made. This ranking function can be integrated into an adversarial imitation learning scheme resulting in an algorithm that can learn behaviors without exploiting the learned reward function. We demonstrate the effectiveness of Rank2Reward at learning behaviors from raw video on a number of tabletop manipulation tasks in both simulations and on a real-world robotic arm. We also demonstrate how Rank2Reward can be easily extended to be applicable to web-scale video datasets.

* ICRA 2024

Via

Access Paper or Ask Questions

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Apr 15, 2024

Siddhant Bansal, Michael Wray, Dima Damen

Figure 1 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 2 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 3 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 4 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Abstract:Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.

* Project Page: https://sid2697.github.io/hoi-ref/

Via

Access Paper or Ask Questions

TIM: A Time Interval Machine for Audio-Visual Action Recognition

Apr 09, 2024

Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen

Figure 1 for TIM: A Time Interval Machine for Audio-Visual Action Recognition

Figure 2 for TIM: A Time Interval Machine for Audio-Visual Action Recognition

Figure 3 for TIM: A Time Interval Machine for Audio-Visual Action Recognition

Figure 4 for TIM: A Time Interval Machine for Audio-Visual Action Recognition

Abstract:Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

* Accepted to CVPR 2024. Project Webpage: https://jacobchalk.github.io/TIM-Project

Via

Access Paper or Ask Questions

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Apr 07, 2024

Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen

Figure 1 for Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Figure 2 for Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Figure 3 for Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Figure 4 for Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Abstract:As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera - hence keeping in mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.

* 21 pages including references and appendix. Project Webpage: http://dimadamen.github.io/OSNOM/

Via

Access Paper or Ask Questions

Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Mar 26, 2024

Saptarshi Sinha, Alexandros Stergiou, Dima Damen

Figure 1 for Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Figure 2 for Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Figure 3 for Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Figure 4 for Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Abstract:Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. On RepCount, ESCounts increases the off-by-one from 0.39 to 0.56 and decreases the mean absolute error from 0.38 to 0.21. Detailed ablations further demonstrate the effectiveness of our method.

* Project website: https://sinhasaptarshi.github.io/escounts

Via

Access Paper or Ask Questions

Video Editing for Video Retrieval

Feb 04, 2024

Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

Figure 1 for Video Editing for Video Retrieval

Figure 2 for Video Editing for Video Retrieval

Figure 3 for Video Editing for Video Retrieval

Figure 4 for Video Editing for Video Retrieval

Abstract:Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.

Via

Access Paper or Ask Questions

Get a Grip: Reconstructing Hand-Object Stable Grasps in Egocentric Videos

Dec 25, 2023

Zhifan Zhu, Dima Damen

Abstract:We address in-the-wild hand-object reconstruction for a known object category in egocentric videos, focusing on temporal periods of stable grasps. We propose the task of Hand-Object Stable Grasp Reconstruction (HO-SGR), the joint reconstruction of frames during which the hand is stably holding the object. We thus can constrain the object motion relative to the hand, effectively regularising the reconstruction and improving performance. By analysing the 3D ARCTIC dataset, we identify temporal periods where the contact area between the hand and object vertices remain stable. We showcase that objects within stable grasps move within a single degree of freedom (1~DoF). We thus propose a method for jointly optimising all frames within a stable grasp by minimising the object's rotation to that within a latent 1 DoF. We then extend this knowledge to in-the-wild egocentric videos by labelling 2.4K clips of stable grasps from the EPIC-KITCHENS dataset. Our proposed EPIC-Grasps dataset includes 390 object instances of 9 categories, featuring stable grasps from videos of daily interactions in 141 environments. Our method achieves significantly better HO-SGR, both qualitatively and by computing the stable grasp area and 2D projection labels of mask overlaps.

* webpage: https://zhifanzhu.github.io/getagrip

Via

Access Paper or Ask Questions

Perception Test 2023: A Summary of the First Challenge And Outcome

Dec 20, 2023

Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

Figure 1 for Perception Test 2023: A Summary of the First Challenge And Outcome

Figure 2 for Perception Test 2023: A Summary of the First Challenge And Outcome

Figure 3 for Perception Test 2023: A Summary of the First Challenge And Outcome

Figure 4 for Perception Test 2023: A Summary of the First Challenge And Outcome

Abstract:The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio, and text modalities, and covering: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, and grounded video question-answering. We summarise in this report the task descriptions, metrics, baselines, and results.

Via

Access Paper or Ask Questions

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Dec 12, 2023

Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

Figure 1 for GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Figure 2 for GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Figure 3 for GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Figure 4 for GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Abstract:We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.

Via

Access Paper or Ask Questions

Learning from One Continuous Video Stream

Dec 01, 2023

João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar(+2 more)

Figure 1 for Learning from One Continuous Video Stream

Figure 2 for Learning from One Continuous Video Stream

Figure 3 for Learning from One Continuous Video Stream

Figure 4 for Learning from One Continuous Video Stream

Abstract:We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.

Via

Access Paper or Ask Questions