Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alice Zhang

Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Jul 16, 2025

Alice Zhang, Callihan Bertley, Dawei Liang, Edison Thomaz

Figure 1 for Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Figure 2 for Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Figure 3 for Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Figure 4 for Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Abstract:Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect a foundational aspect of human social interactions, in-person verbal conversations, by leveraging audio and inertial data captured with a commodity smartwatch in acoustically-challenging scenarios. To evaluate our approach, we conducted a lab study with 11 participants and a semi-naturalistic study with 24 participants. We analyzed machine learning and deep learning models with 3 different fusion methods, showing the advantages of fusing audio and inertial data to consider not only verbal cues but also non-verbal gestures in conversations. Furthermore, we perform a comprehensive set of evaluations across activities and sampling rates to demonstrate the benefits of multimodal sensing in specific contexts. Overall, our framework achieved 82.0$\pm$3.0% macro F1-score when detecting conversations in the lab and 77.2$\pm$1.8% in the semi-naturalistic setting.

Via

Access Paper or Ask Questions

Transformation of audio embeddings into interpretable, concept-based representations

Apr 18, 2025

Alice Zhang, Edison Thomaz, Lie Lu

Abstract:Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabularies for concept-based interpretability of audio embeddings.

* Accepted to International Joint Conference on Neural Networks (IJCNN) 2025

Via

Access Paper or Ask Questions