Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kiran Somasundaram

Reading Recognition in the Wild

May 30, 2025

Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi(+5 more)

Abstract:To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.

Via

Access Paper or Ask Questions

Aria Everyday Activities Dataset

Feb 22, 2024

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong(+14 more)

Figure 1 for Aria Everyday Activities Dataset

Figure 2 for Aria Everyday Activities Dataset

Figure 3 for Aria Everyday Activities Dataset

Figure 4 for Aria Everyday Activities Dataset

Abstract:We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data including high frequency globally aligned 3D trajectories, scene point cloud, per-frame 3D eye gaze vector and time aligned speech transcription. In this paper, we demonstrate a few exemplar research applications enabled by this dataset, including neural scene reconstruction and prompted segmentation. AEA is an open source dataset that can be downloaded from https://www.projectaria.com/datasets/aea/. We are also providing open-source implementations and examples of how to use the dataset in Project Aria Tools https://github.com/facebookresearch/projectaria_tools.

* Dataset website: https://www.projectaria.com/datasets/aea/

Via

Access Paper or Ask Questions

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Nov 30, 2023

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote(+91 more)

Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Abstract:We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

Via

Access Paper or Ask Questions

FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things

Sep 29, 2023

Samiul Alam, Tuo Zhang, Tiantian Feng, Hui Shen, Zhichao Cao, Dong Zhao, JeongGil Ko, Kiran Somasundaram, Shrikanth S. Narayanan, Salman Avestimehr(+1 more)

Figure 1 for FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things

Figure 2 for FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things

Figure 3 for FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things

Figure 4 for FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things

Abstract:There is a significant relevance of federated learning (FL) in the realm of Artificial Intelligence of Things (AIoT). However, most existing FL works are not conducted on datasets collected from authentic IoT devices that capture unique modalities and inherent challenges of IoT data. In this work, we introduce FedAIoT, an FL benchmark for AIoT to fill this critical gap. FedAIoT includes eight datatsets collected from a wide range of IoT devices. These datasets cover unique IoT modalities and target representative applications of AIoT. FedAIoT also includes a unified end-to-end FL framework for AIoT that simplifies benchmarking the performance of the datasets. Our benchmark results shed light on the opportunities and challenges of FL for AIoT. We hope FedAIoT could serve as an invaluable resource to foster advancements in the important field of FL for AIoT. The repository of FedAIoT is maintained at https://github.com/AIoT-MLSys-Lab/FedAIoT.

Via

Access Paper or Ask Questions

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Sep 12, 2023

Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob Julian Engel, Renzo De Nardi, Richard Newcombe

Figure 1 for Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Figure 2 for Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Figure 3 for Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Figure 4 for Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Abstract:Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, multi-modal data recording and streaming device with the goal to foster and accelerate research in this area. In this paper, we describe the Aria device hardware including its sensor configuration and the corresponding software tools that enable recording and processing of such data.

Via

Access Paper or Ask Questions

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Oct 13, 2021

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu(+74 more)

Figure 1 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 2 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 3 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 4 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abstract:We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Via

Access Paper or Ask Questions

Egocentric Activity Recognition and Localization on a 3D Map

May 27, 2021

Miao Liu, Lingni Ma, Kiran Somasundaram, Yin Li, Kristen Grauman, James M. Rehg, Chao Li

Figure 1 for Egocentric Activity Recognition and Localization on a 3D Map

Figure 2 for Egocentric Activity Recognition and Localization on a 3D Map

Figure 3 for Egocentric Activity Recognition and Localization on a 3D Map

Figure 4 for Egocentric Activity Recognition and Localization on a 3D Map

Abstract:Given a video captured from a first person perspective and recorded in a familiar environment, can we recognize what the person is doing and identify where the action occurs in the 3D space? We address this challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. To this end, we propose a novel deep probabilistic model. Our model takes the inputs of a Hierarchical Volumetric Representation (HVR) of the environment and an egocentric video, infers the 3D action location as a latent variable, and recognizes the action based on the video and contextual cues surrounding its potential locations. To evaluate our model, we conduct extensive experiments on a newly collected egocentric video dataset, in which both human naturalistic actions and photo-realistic 3D environment reconstructions are captured. Our method demonstrates strong results on both action recognition and 3D action localization across seen and unseen environments. We believe our work points to an exciting research direction in the intersection of egocentric vision, and 3D scene understanding.

Via

Access Paper or Ask Questions

An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component

Mar 31, 2017

Onkar Dabeer, Radhika Gowaikar, Slawomir K. Grzechnik, Mythreya J. Lakshman, Gerhard Reitmayr, Kiran Somasundaram, Ravi Teja Sukhavasi, Xinzhou Wu

Figure 1 for An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component

Figure 2 for An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component

Figure 3 for An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component

Figure 4 for An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component

Abstract:Autonomous vehicles rely on precise high definition (HD) 3d maps for navigation. This paper presents the mapping component of an end-to-end system for crowdsourcing precise 3d maps with semantically meaningful landmarks such as traffic signs (6 dof pose, shape and size) and traffic lanes (3d splines). The system uses consumer grade parts, and in particular, relies on a single front facing camera and a consumer grade GPS. Using real-time sign and lane triangulation on-device in the vehicle, with offline sign/lane clustering across multiple journeys and offline Bundle Adjustment across multiple journeys in the backend, we construct maps with mean absolute accuracy at sign corners of less than 20 cm from 25 journeys. To the best of our knowledge, this is the first end-to-end HD mapping pipeline in global coordinates in the automotive context using cost effective sensors.

Via

Access Paper or Ask Questions