Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ingo Bax

AirLetters: An Open Video Dataset of Characters Drawn in the Air

Oct 03, 2024

Rishit Dagli, Guillaume Berger, Joanna Materzynska, Ingo Bax, Roland Memisevic

Figure 1 for AirLetters: An Open Video Dataset of Characters Drawn in the Air

Figure 2 for AirLetters: An Open Video Dataset of Characters Drawn in the Air

Figure 3 for AirLetters: An Open Video Dataset of Characters Drawn in the Air

Figure 4 for AirLetters: An Open Video Dataset of Characters Drawn in the Air

Abstract:We introduce AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions. Specifically, our dataset requires a vision model to predict letters that humans draw in the air. Unlike existing video datasets, accurate classification predictions for AirLetters rely critically on discerning motion patterns and on integrating long-range information in the video over time. An extensive evaluation of state-of-the-art image and video understanding models on AirLetters shows that these methods perform poorly and fall far behind a human baseline. Our work shows that, despite recent progress in end-to-end video understanding, accurate representations of complex articulated motions -- a task that is trivial for humans -- remains an open problem for end-to-end learning.

* ECCV'24, HANDS workshop

Via

Access Paper or Ask Questions

Live Fitness Coaching as a Testbed for Situated Interaction

Jul 11, 2024

Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee(+3 more)

Figure 1 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 2 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 3 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 4 for Live Fitness Coaching as a Testbed for Situated Interaction

Abstract:Tasks at the intersection of vision and language have had a profound impact in advancing the capabilities of vision-language models such as dialog-based assistants. However, models trained on existing tasks are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time are an open challenge. In this work, we present the QEVD benchmark and dataset which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching - a task which intrinsically requires monitoring live user activity and providing timely feedback. It is the first benchmark that requires assistive vision-language models to recognize complex human actions, identify mistakes grounded in those actions, and provide appropriate feedback. Our experiments reveal the limitations of existing state of the art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedbacks at the appropriate time.

* The benchmark and dataset is available here: https://developer.qualcomm.com/software/ai-datasets/qevd

Via

Access Paper or Ask Questions

Is end-to-end learning enough for fitness activity recognition?

May 14, 2023

Antoine Mercier, Guillaume Berger, Sunny Panchal, Florian Letsch, Cornelius Boehm, Nahua Kang, Ingo Bax, Roland Memisevic

Figure 1 for Is end-to-end learning enough for fitness activity recognition?

Figure 2 for Is end-to-end learning enough for fitness activity recognition?

Figure 3 for Is end-to-end learning enough for fitness activity recognition?

Figure 4 for Is end-to-end learning enough for fitness activity recognition?

Abstract:End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.

* 9 pages, 4 figures, 4 tables

Via

Access Paper or Ask Questions

The "something something" video database for learning and evaluating visual common sense

Jun 15, 2017

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag(+4 more)

Figure 1 for The "something something" video database for learning and evaluating visual common sense

Figure 2 for The "something something" video database for learning and evaluating visual common sense

Figure 3 for The "something something" video database for learning and evaluating visual common sense

Figure 4 for The "something something" video database for learning and evaluating visual common sense

Abstract:Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the "something-something" database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.

Via

Access Paper or Ask Questions