Abstract:Masks are essential in medical settings and during infectious outbreaks but significantly impair speech communication, especially in environments with background noise. Existing solutions often require substantial computational resources or compromise hygiene and comfort. We propose a novel sensing approach that captures only the wearer's voice by detecting mask surface vibrations using a piezoelectric sensor. Our developed device, MaskClip, employs a stainless steel clip with an optimally positioned piezoelectric sensor to selectively capture speech vibrations while inherently filtering out ambient noise. Evaluation experiments demonstrated superior performance with a low Character Error Rate of 6.1\% in noisy environments compared to conventional microphones. Subjective evaluations by 102 participants also showed high satisfaction scores. This approach shows promise for applications in settings where clear voice communication must be maintained while wearing protective equipment, such as medical facilities, cleanrooms, and industrial environments.
Abstract:Rich-text captions are essential to help communication for Deaf and hard-of-hearing (DHH) people, second-language learners, and those with autism spectrum disorder (ASD). They also preserve nuances when converting speech to text, enhancing the realism of presentation scripts and conversation or speech logs. However, current real-time captioning systems lack the capability to alter text attributes (ex. capitalization, sizes, and fonts) at the word level, hindering the accurate conveyance of speaker intent that is expressed in the tones or intonations of the speech. For example, ''YOU should do this'' tends to be considered as indicating ''You'' as the focus of the sentence, whereas ''You should do THIS'' tends to be ''This'' as the focus. This paper proposes a solution that changes the text decorations at the word level in real time. As a prototype, we developed an application that adjusts word size based on the loudness of each spoken word. Feedback from users implies that this system helped to convey the speaker's intent, offering a more engaging and accessible captioning experience.
Abstract:Whispering is a common privacy-preserving technique in voice-based interactions, but its effectiveness is limited in noisy environments. In conventional hardware- and software-based noise reduction approaches, isolating whispered speech from ambient noise and other speech sounds remains a challenge. We thus propose WhisperMask, a mask-type microphone featuring a large diaphragm with low sensitivity, making the wearer's voice significantly louder than the background noise. We evaluated WhisperMask using three key metrics: signal-to-noise ratio, quality of recorded voices, and speech recognition rate. Across all metrics, WhisperMask consistently outperformed traditional noise-suppressing microphones and software-based solutions. Notably, WhisperMask showed a 30% higher recognition accuracy for whispered speech recorded in an environment with 80 dB background noise compared with the pin microphone and earbuds. Furthermore, while a denoiser decreased the whispered speech recognition rate of these two microphones by approximately 20% at 30-60 dB noise, WhisperMask maintained a high performance even without denoising, surpassing the other microphones' performances by a significant margin.WhisperMask's design renders the wearer's voice as the dominant input and effectively suppresses background noise without relying on signal processing. This device allows for reliable voice interactions, such as phone calls and voice commands, in a wide range of noisy real-world scenarios while preserving user privacy.