Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Giovanni Maria Farinella

FPV@IPLAB - Department of Mathematics and Computer Science - University of Catania - Italy, Cognitive Robotics and Social Sensing Laboratory - ICAR-CNR - Palermo - Italy, Next Vision s.r.l. - Catania - Italy

EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Jun 06, 2025

Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella

Abstract:We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.

Via

Access Paper or Ask Questions

Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos

Feb 26, 2025

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

Abstract:We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

* arXiv admin note: text overlap with arXiv:2406.01486

Via

Access Paper or Ask Questions

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Nov 25, 2024

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Giovanni Maria Farinella, Christian Micheloni

Figure 1 for Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Figure 2 for Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Figure 3 for Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Figure 4 for Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Abstract:Episodic memory retrieval aims to enable wearable devices with the ability to recollect from past video observations objects or events that have been observed (e.g., "where did I last see my smartphone?"). Despite the clear relevance of the task for a wide range of assistive systems, current task formulations are based on the "offline" assumption that the full video history can be accessed when the user makes a query, which is unrealistic in real settings, where wearable devices are limited in power and storage capacity. We introduce the novel task of Online Episodic Memory Visual Queries Localization (OEM-VQL), in which models are required to work in an online fashion, observing video frames only once and relying on past computations to answer user queries. To tackle this challenging task, we propose ESOM - Egocentric Streaming Object Memory, a novel framework based on an object discovery module to detect potentially interesting objects, a visual object tracker to track their position through the video in an online fashion, and a memory module to store spatio-temporal object coordinates and image representations, which can be queried efficiently at any moment. Comparisons with different baselines and offline methods show that OEM-VQL is challenging and ESOM is a viable approach to tackle the task, with results outperforming offline methods (81.92 vs 55.89 success rate %) when oracular object discovery and tracking are considered. Our analysis also sheds light on the limited performance of object detection and tracking in egocentric vision, providing a principled benchmark based on the OEM-VQL downstream task to assess progress in these areas.

Via

Access Paper or Ask Questions

LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

Nov 23, 2024

Anxhelo Diko, Antonino Furnari, Luigi Cinque, Giovanni Maria Farinella

Figure 1 for LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

Figure 2 for LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

Figure 3 for LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

Figure 4 for LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

Abstract:Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce \mnamelong, a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. \mname defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. %We empirically demonstrate \mname's superiority in domain adaptation tasks across four diverse images and video datasets. Remarkably, \mname surpasses previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.

Via

Access Paper or Ask Questions

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Nov 04, 2024

Leonardo Plini, Luca Scofano, Edoardo De Matteis, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Andrea Sanchietti, Giovanni Maria Farinella, Fabio Galasso, Antonino Furnari

Figure 1 for TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Figure 2 for TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Figure 3 for TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Figure 4 for TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Abstract:Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

Via

Access Paper or Ask Questions

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

Jul 05, 2024

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero-Campo, Giovanni Maria Farinella

Abstract:Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+{\delta} mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

* arXiv admin note: substantial text overlap with arXiv:2406.01194

Via

Access Paper or Ask Questions

Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

Jun 12, 2024

Michele Mazzamuto, Antonino Furnari, Giovanni Maria Farinella

Figure 1 for Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

Figure 2 for Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

Figure 3 for Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

Figure 4 for Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

Abstract:In this paper, we address the challenge of unsupervised mistake detection in egocentric video through the analysis of gaze signals, a critical component for advancing user assistance in smart glasses. Traditional supervised methods, reliant on manually labeled mistakes, suffer from domain-dependence and scalability issues. This research introduces an unsupervised method for detecting mistakes in videos of human activities, overcoming the challenges of domain-specific requirements and the necessity for annotated data. By analyzing unusual gaze patterns that signal user disorientation during tasks, we propose a gaze completion model that forecasts eye gaze trajectories from incomplete inputs. The difference between the anticipated and observed gaze paths acts as an indicator for identifying errors. Our method is validated on the EPIC-Tent dataset, showing its superiority compared to current one-class supervised and unsupervised techniques.

Via

Access Paper or Ask Questions

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Jun 03, 2024

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero, Giovanni Maria Farinella, Antonino Furnari

Figure 1 for AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Figure 2 for AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Figure 3 for AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Figure 4 for AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Abstract:Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre extracted affordances on Ego4D and EPIC- Kitchens to encourage future research in this area.

Via

Access Paper or Ask Questions

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Jun 03, 2024

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

Figure 1 for Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Figure 2 for Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Figure 3 for Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Figure 4 for Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Abstract:Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges' weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101 and EPIC-Tent datasets. Code for replicating experiments is available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

Via

Access Paper or Ask Questions

PhilHumans: Benchmarking Machine Learning for Personal Health

May 04, 2024

Vadim Liventsev, Vivek Kumar, Allmin Pradhap Singh Susaiyah, Zixiu Wu, Ivan Rodin, Asfand Yaar, Simone Baloccu, Marharyta Beraziuk, Sebastiano Battiato, Giovanni Maria Farinella(+7 more)

Figure 1 for PhilHumans: Benchmarking Machine Learning for Personal Health

Figure 2 for PhilHumans: Benchmarking Machine Learning for Personal Health

Figure 3 for PhilHumans: Benchmarking Machine Learning for Personal Health

Figure 4 for PhilHumans: Benchmarking Machine Learning for Personal Health

Abstract:The use of machine learning in Healthcare has the potential to improve patient outcomes as well as broaden the reach and affordability of Healthcare. The history of other application areas indicates that strong benchmarks are essential for the development of intelligent systems. We present Personal Health Interfaces Leveraging HUman-MAchine Natural interactions (PhilHumans), a holistic suite of benchmarks for machine learning across different Healthcare settings - talk therapy, diet coaching, emergency care, intensive care, obstetric sonography - as well as different learning settings, such as action anticipation, timeseries modeling, insight mining, language modeling, computer vision, reinforcement learning and program synthesis

Via

Access Paper or Ask Questions