Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siddhant Bansal

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Feb 06, 2025

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan(+9 more)

Abstract:We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

* 29 pages. Project Webpage and Dataset: http://hd-epic.github.io

Via

Access Paper or Ask Questions

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Apr 15, 2024

Siddhant Bansal, Michael Wray, Dima Damen

Figure 1 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 2 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 3 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Figure 4 for HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Abstract:Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.

* Project Page: https://sid2697.github.io/hoi-ref/

Via

Access Paper or Ask Questions

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Nov 30, 2023

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote(+91 more)

Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Abstract:We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

Via

Access Paper or Ask Questions

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Nov 06, 2023

Siddhant Bansal, Chetan Arora, C. V. Jawahar

Figure 1 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 2 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 3 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 4 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Abstract:Given multiple videos of the same task, procedure learning addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algorithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.

* 13 pages, 6 figures, Accepted in Winter Conference on Applications of Computer Vision (WACV), 2024

Via

Access Paper or Ask Questions

An Outlook into the Future of Egocentric Vision

Aug 14, 2023

Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi

Figure 1 for An Outlook into the Future of Egocentric Vision

Figure 2 for An Outlook into the Future of Egocentric Vision

Figure 3 for An Outlook into the Future of Egocentric Vision

Figure 4 for An Outlook into the Future of Egocentric Vision

Abstract:What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

* We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1w

Via

Access Paper or Ask Questions

My View is the Best View: Procedure Learning from Egocentric Videos

Jul 22, 2022

Siddhant Bansal, Chetan Arora, C. V. Jawahar

Figure 1 for My View is the Best View: Procedure Learning from Egocentric Videos

Figure 2 for My View is the Best View: Procedure Learning from Egocentric Videos

Figure 3 for My View is the Best View: Procedure Learning from Egocentric Videos

Figure 4 for My View is the Best View: Procedure Learning from Egocentric Videos

Abstract:Procedure learning involves identifying the key-steps and determining their logical order to perform a task. Existing approaches commonly use third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. However, procedure learning from egocentric videos is challenging because (a) the camera view undergoes extreme changes due to the wearer's head motion, and (b) the presence of unrelated frames due to the unconstrained nature of the videos. Due to this, current state-of-the-art methods' assumptions that the actions occur at approximately the same time and are of the same duration, do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively. Furthermore, for procedure learning using egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of videos captured by 130 subjects performing 16 tasks. The source code and the dataset are available on the project page https://sid2697.github.io/egoprocel/.

* 25 pages, 6 figures, Accepted in European Conference on Computer Vision (ECCV) 2022

Via

Access Paper or Ask Questions

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Oct 13, 2021

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu(+74 more)

Figure 1 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 2 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 3 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 4 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abstract:We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Via

Access Paper or Ask Questions

Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Oct 27, 2020

Siddhant Bansal, Praveen Krishnan, C. V. Jawahar

Figure 1 for Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Figure 2 for Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Figure 3 for Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Figure 4 for Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Abstract:We propose a novel scheme for improving the word recognition accuracy using word image embeddings. We use a trained text recognizer, which can predict multiple text hypothesis for a given word image. Our fusion scheme improves the recognition process by utilizing the word image and text embeddings obtained from a trained word image embedding network. We propose EmbedNet, which is trained using a triplet loss for learning a suitable embedding space where the embedding of the word image lies closer to the embedding of the corresponding text transcription. The updated embedding space thus helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). The CAB module takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings to generate an updated distance vector. The updated distance vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books in the Hindi language. Our method achieves an absolute improvement of around 10 percent in terms of word recognition accuracy.

* 8 pages, 6 figures, Accepted in International Conference on Pattern Recognition (ICPR) 2020

Via

Access Paper or Ask Questions

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Jul 01, 2020

Siddhant Bansal, Praveen Krishnan, C. V. Jawahar

Figure 1 for Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Figure 2 for Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Figure 3 for Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Figure 4 for Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Abstract:Recognition and retrieval of textual content from the large document collections have been a powerful use case for the document image analysis community. Often the word is the basic unit for recognition as well as retrieval. Systems that rely only on the text recogniser (OCR) output are not robust enough in many situations, especially when the word recognition rates are poor, as in the case of historic documents or digital libraries. An alternative has been word spotting based methods that retrieve/match words based on a holistic representation of the word. In this paper, we fuse the noisy output of text recogniser with a deep embeddings representation derived out of the entire word. We use average and max fusion for improving the ranked results in the case of retrieval. We validate our methods on a collection of Hindi documents. We improve word recognition rate by 1.4 and retrieval by 11.13 in the mAP.

* 15 pages, 8 figures, Accepted in IAPR International Workshop on Document Analysis Systems (DAS) 2020, "Visit project page, at http://cvit.iiit.ac.in/research/projects/cvit-projects/fused-text-recogniser-and-deep-embeddings-improve-word-recognition-and-retrieval"

Via

Access Paper or Ask Questions

AGDC: Automatic Garbage Detection and Collection

Aug 16, 2019

Siddhant Bansal, Seema Patel, Ishita Shah, Prof. Alpesh Patel, Prof. Jagruti Makwana, Dr. Rajesh Thakker

Figure 1 for AGDC: Automatic Garbage Detection and Collection

Figure 2 for AGDC: Automatic Garbage Detection and Collection

Figure 3 for AGDC: Automatic Garbage Detection and Collection

Figure 4 for AGDC: Automatic Garbage Detection and Collection

Abstract:Waste management is one of the significant problems throughout the world. Contemporaneous methods find it difficult to manage the volume of solid waste generated by the growing urban population. In this paper, we propose a system which is very hygienic and cheap that uses Artificial Intelligence algorithms for detection of the garbage. Once the garbage is detected the system calculates the position of the garbage by the use of the camera only. The proposed system is capable of distinguishing between valuables and garbage with more than 95% confidence in real-time. Finally, a robotic arm controlled by the microcontroller is used to pick up the garbage and places it in the bin. Concluding, the paper explains a system that is capable of working as a human in terms of inspecting and collecting the garbage. The system is able to achieve 3-4 frames per second on the Raspberry Pi, capable of detecting the garbage in real-time with 90%+ confidence.

Via

Access Paper or Ask Questions