Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sarah Ostadabbas

AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Oct 30, 2025

Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas

Abstract:Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.

* Accepted at 32nd International Conference on MultiMedia Modeling

Via

Access Paper or Ask Questions

Learning Multimodal AI Algorithms for Amplifying Limited User Input into High-dimensional Control Space

May 16, 2025

Ali Rabiee, Sima Ghafoori, MH Farhadi, Robert Beyer, Xiangyu Bai, David J Lin, Sarah Ostadabbas, Reza Abiri

Abstract:Current invasive assistive technologies are designed to infer high-dimensional motor control signals from severely paralyzed patients. However, they face significant challenges, including public acceptance, limited longevity, and barriers to commercialization. Meanwhile, noninvasive alternatives often rely on artifact-prone signals, require lengthy user training, and struggle to deliver robust high-dimensional control for dexterous tasks. To address these issues, this study introduces a novel human-centered multimodal AI approach as intelligent compensatory mechanisms for lost motor functions that could potentially enable patients with severe paralysis to control high-dimensional assistive devices, such as dexterous robotic arms, using limited and noninvasive inputs. In contrast to the current state-of-the-art (SoTA) noninvasive approaches, our context-aware, multimodal shared-autonomy framework integrates deep reinforcement learning algorithms to blend limited low-dimensional user input with real-time environmental perception, enabling adaptive, dynamic, and intelligent interpretation of human intent for complex dexterous manipulation tasks, such as pick-and-place. The results from our ARAS (Adaptive Reinforcement learning for Amplification of limited inputs in Shared autonomy) trained with synthetic users over 50,000 computer simulation episodes demonstrated the first successful implementation of the proposed closed-loop human-in-the-loop paradigm, outperforming the SoTA shared autonomy algorithms. Following a zero-shot sim-to-real transfer, ARAS was evaluated on 23 human subjects, demonstrating high accuracy in dynamic intent detection and smooth, stable 3D trajectory control for dexterous pick-and-place tasks. ARAS user study achieved a high task success rate of 92.88%, with short completion times comparable to those of SoTA invasive assistive technologies.

Via

Access Paper or Ask Questions

STREAMS: An Assistive Multimodal AI Framework for Empowering Biosignal Based Robotic Controls

Oct 04, 2024

Ali Rabiee, Sima Ghafoori, Xiangyu Bai, Sarah Ostadabbas, Reza Abiri

Figure 1 for STREAMS: An Assistive Multimodal AI Framework for Empowering Biosignal Based Robotic Controls

Figure 2 for STREAMS: An Assistive Multimodal AI Framework for Empowering Biosignal Based Robotic Controls

Figure 3 for STREAMS: An Assistive Multimodal AI Framework for Empowering Biosignal Based Robotic Controls

Figure 4 for STREAMS: An Assistive Multimodal AI Framework for Empowering Biosignal Based Robotic Controls

Abstract:End-effector based assistive robots face persistent challenges in generating smooth and robust trajectories when controlled by human's noisy and unreliable biosignals such as muscle activities and brainwaves. The produced endpoint trajectories are often jerky and imprecise to perform complex tasks such as stable robotic grasping. We propose STREAMS (Self-Training Robotic End-to-end Adaptive Multimodal Shared autonomy) as a novel framework leveraged deep reinforcement learning to tackle this challenge in biosignal based robotic control systems. STREAMS blends environmental information and synthetic user input into a Deep Q Learning Network (DQN) pipeline for an interactive end-to-end and self-training mechanism to produce smooth trajectories for the control of end-effector based robots. The proposed framework achieved a high-performance record of 98% in simulation with dynamic target estimation and acquisition without any pre-existing datasets. As a zero-shot sim-to-real user study with five participants controlling a physical robotic arm with noisy head movements, STREAMS (as an assistive mode) demonstrated significant improvements in trajectory stabilization, user satisfaction, and task performance reported as a success rate of 83% compared to manual mode which was 44% without any task support. STREAMS seeks to improve biosignal based assistive robotic controls by offering an interactive, end-to-end solution that stabilizes end-effector trajectories, enhancing task performance and accuracy.

Via

Access Paper or Ask Questions

Multiple Toddler Tracking in Indoor Videos

Nov 29, 2023

Somaieh Amraee, Bishoy Galoaa, Matthew Goodwin, Elaheh Hatamimajoumerd, Sarah Ostadabbas

Figure 1 for Multiple Toddler Tracking in Indoor Videos

Figure 2 for Multiple Toddler Tracking in Indoor Videos

Figure 3 for Multiple Toddler Tracking in Indoor Videos

Figure 4 for Multiple Toddler Tracking in Indoor Videos

Abstract:Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions and limited fields of view. In this paper, we address the challenges of MTT and propose MTTSort, a customized method built upon the DeepSort algorithm. MTTSort is designed to track multiple toddlers in indoor videos accurately. Our contributions include discussing the primary challenges in MTT, introducing a genetic algorithm to optimize hyperparameters, proposing an accurate tracking algorithm, and curating the MTTrack dataset using unbiased AI co-labeling techniques. We quantitatively compare MTTSort to state-of-the-art MOT methods on MTTrack, DanceTrack, and MOT15 datasets. In our evaluation, the proposed method outperformed other MOT methods, achieving 0.98, 0.68, and 0.98 in multiple object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), and iterative and discriminative framework 1 (IDF1) metrics, respectively.

Via

Access Paper or Ask Questions

Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Nov 21, 2023

Elaheh Hatamimajoumerd, Pooria Daneshvar Kakhaki, Xiaofei Huang, Lingfei Luan, Somaieh Amraee, Sarah Ostadabbas

Figure 1 for Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Figure 2 for Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Figure 3 for Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Figure 4 for Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Abstract:Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering parent-infant bonds, advancing computer-aided diagnostics, and contributing to the scientific comprehension of child development. This paper delves into the intricacies of infant action recognition, a domain that has remained relatively uncharted despite the accomplishments in adult action recognition. In this study, we introduce a groundbreaking dataset called ``InfActPrimitive'', encompassing five significant infant milestone action categories, and we incorporate specialized preprocessing for infant data. We conducted an extensive comparative analysis employing cutting-edge skeleton-based action recognition models using this dataset. Our findings reveal that, although the PoseC3D model achieves the highest accuracy at approximately 71%, the remaining models struggle to accurately capture the dynamics of infant actions. This highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.

Via

Access Paper or Ask Questions

Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as a Neurodevelopmental Cue

Oct 24, 2023

Shaotong Zhu, Michael Wan, Sai Kumar Reddy Manne, Emily Zimmerman, Sarah Ostadabbas

Abstract:Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage.

Via

Access Paper or Ask Questions

Automatic Infant Respiration Estimation from Video: A Deep Flow-based Algorithm and a Novel Public Benchmark

Jul 24, 2023

Sai Kumar Reddy Manne, Shaotong Zhu, Sarah Ostadabbas, Michael Wan

Abstract:Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings. Our automated infant respiration flow-based network (AIRFlowNet) combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. We support our model with the first public annotated infant respiration dataset with 125 videos (AIR-125), drawn from eight infant subjects, set varied pose, lighting, and camera conditions. We include manual respiration annotations and optimize AIRFlowNet training on them using a novel spectral bandpass loss function. When trained and tested on the AIR-125 infant data, our method significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute, compared to $\sim$4.7--6.2 for other public models designed for adult subjects and more uniform environments.

Via

Access Paper or Ask Questions

Bridging the Domain Gap between Synthetic and Real-World Data for Autonomous Driving

Jun 05, 2023

Xiangyu Bai, Yedi Luo, Le Jiang, Aniket Gupta, Pushyami Kaveti, Hanumant Singh, Sarah Ostadabbas

Abstract:Modern autonomous systems require extensive testing to ensure reliability and build trust in ground vehicles. However, testing these systems in the real-world is challenging due to the lack of large and diverse datasets, especially in edge cases. Therefore, simulations are necessary for their development and evaluation. However, existing open-source simulators often exhibit a significant gap between synthetic and real-world domains, leading to deteriorated mobility performance and reduced platform reliability when using simulation data. To address this issue, our Scoping Autonomous Vehicle Simulation (SAVeS) platform benchmarks the performance of simulated environments for autonomous ground vehicle testing between synthetic and real-world domains. Our platform aims to quantify the domain gap and enable researchers to develop and test autonomous systems in a controlled environment. Additionally, we propose using domain adaptation technologies to address the domain gap between synthetic and real-world data with our SAVeS$^+$ extension. Our results demonstrate that SAVeS$^+$ is effective in helping to close the gap between synthetic and real-world domains and yields comparable performance for models trained with processed synthetic datasets to those trained on real-world datasets of same scale. This paper highlights our efforts to quantify and address the domain gap between synthetic and real-world data for autonomy simulation. By enabling researchers to develop and test autonomous systems in a controlled environment, we hope to bring autonomy simulation one step closer to realization.

Via

Access Paper or Ask Questions

SPAC-Net: Synthetic Pose-aware Animal ControlNet for Enhanced Pose Estimation

May 31, 2023

Le Jiang, Sarah Ostadabbas

Abstract:Animal pose estimation has become a crucial area of research, but the scarcity of annotated data is a significant challenge in developing accurate models. Synthetic data has emerged as a promising alternative, but it frequently exhibits domain discrepancies with real data. Style transfer algorithms have been proposed to address this issue, but they suffer from insufficient spatial correspondence, leading to the loss of label information. In this work, we present a new approach called Synthetic Pose-aware Animal ControlNet (SPAC-Net), which incorporates ControlNet into the previously proposed Prior-Aware Synthetic animal data generation (PASyn) pipeline. We leverage the plausible pose data generated by the Variational Auto-Encoder (VAE)-based data generation pipeline as input for the ControlNet Holistically-nested Edge Detection (HED) boundary task model to generate synthetic data with pose labels that are closer to real data, making it possible to train a high-precision pose estimation network without the need for real data. In addition, we propose the Bi-ControlNet structure to separately detect the HED boundary of animals and backgrounds, improving the precision and stability of the generated data. Using the SPAC-Net pipeline, we generate synthetic zebra and rhino images and test them on the AP10K real dataset, demonstrating superior performance compared to using only real images or synthetic data generated by other methods. Our work demonstrates the potential for synthetic data to overcome the challenge of limited annotated data in animal pose estimation.

* arXiv admin note: text overlap with arXiv:2208.13944

Via

Access Paper or Ask Questions

A Video-based End-to-end Pipeline for Non-nutritive Sucking Action Recognition and Segmentation in Young Infants

Mar 29, 2023

Shaotong Zhu, Michael Wan, Elaheh Hatamimajoumerd, Kashish Jain, Samuel Zlota, Cholpady Vikram Kamath, Cassandra B. Rowan, Emma C. Grace, Matthew S. Goodwin, Marie J. Hayes(+3 more)

Abstract:We present an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS) -- an infant sucking pattern with no nutrition delivered -- as a potential biomarker for developmental delays, using off-the-shelf baby monitor video footage. One barrier to clinical (or algorithmic) assessment of NNS stems from its sparsity, requiring experts to wade through hours of footage to find minutes of relevant activity. Our NNS activity segmentation algorithm solves this problem by identifying periods of NNS with high certainty -- up to 94.0\% average precision and 84.9\% average recall across 30 heterogeneous 60 s clips, drawn from our manually annotated NNS clinical in-crib dataset of 183 hours of overnight baby monitor footage from 19 infants. Our method is based on an underlying NNS action recognition algorithm, which uses spatiotemporal deep learning networks and infant-specific pose estimation, achieving 94.9\% accuracy in binary classification of 960 2.5 s balanced NNS vs. non-NNS clips. Tested on our second, independent, and public NNS in-the-wild dataset, NNS recognition classification reaches 92.3\% accuracy, and NNS segmentation achieves 90.8\% precision and 84.2\% recall.

Via

Access Paper or Ask Questions