Abstract:To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.




Abstract:Autism Spectrum Disorder (ASD) presents significant challenges in early diagnosis and intervention, impacting children and their families. With prevalence rates rising, there is a critical need for accessible and efficient screening tools. Leveraging machine learning (ML) techniques, in particular Temporal Action Localization (TAL), holds promise for automating ASD screening. This paper introduces a self-attention based TAL model designed to identify ASD-related behaviors in infant videos. Unlike existing methods, our approach simplifies complex modeling and emphasizes efficiency, which is essential for practical deployment in real-world scenarios. Importantly, this work underscores the importance of developing computer vision methods capable of operating in naturilistic environments with little equipment control, addressing key challenges in ASD screening. This study is the first to conduct end-to-end temporal action localization in untrimmed videos of infants with ASD, offering promising avenues for early intervention and support. We report baseline results of behavior detection using our TAL model. We achieve 70% accuracy for look face, 79% accuracy for look object, 72% for smile and 65% for vocalization.