Wearing masks is crucial to preventing the spread of potentially pathogen-containing droplets, especially amidst the COVID-19 pandemic. However, not all face coverings are equally effective and most experiments evaluating mask efficacy are very expensive and complex to operate. In this work, a novel, home-built, low-cost, and accurate metrology to visualize orally-generated fluid droplets has been developed. The project includes setup optimization, data collection, data analysis, and applications. The final materials chosen were quinine-containing tonic water, 397-402 nm wavelength UV tube lights, an iPhone and tripod, string, and a spray bottle. The experiment took place in a dark closet with a dark background. During data collection, the test subject first wets their mouth with an ingestible fluorescent liquid (tonic water) and speaks, sneezes, or coughs under UV darklight. The fluorescence from the tonic water droplets generated can be visualized, recorded by an iPhone 8+ camera in slo-mo (240 fps), and analyzed. The software VLC is used for frame separation and Fiji/ImageJ is used for image processing and analysis. The dependencies of oral fluid droplet generation and propagation on different phonics, the loudness of speech, and the type of expiratory event were studied in detail and established using the metrology developed. The efficacy of different types of masks was evaluated and correlated with fabric microstructures. All masks blocked droplets to varying extent. Masks with smaller-sized pores and thicker material were found to block the most droplets. This low-cost technique can be easily constructed at home using materials that total to a cost of less than $50. Despite the minimal cost, the method is very accurate and the data is quantifiable.
Although automatic emotion recognition from facial expressions and speech has made remarkable progress, emotion recognition from body gestures has not been thoroughly explored. People often use a variety of body language to express emotions, and it is difficult to enumerate all emotional body gestures and collect enough samples for each category. Therefore, recognizing new emotional body gestures is critical for better understanding human emotions. However, the existing methods fail to accurately determine which emotional state a new body gesture belongs to. In order to solve this problem, we introduce a Generalized Zero-Shot Learning (GZSL) framework, which consists of three branches to infer the emotional state of the new body gestures with only their semantic descriptions. The first branch is a Prototype-Based Detector (PBD) which is used to determine whether an sample belongs to a seen body gesture category and obtain the prediction results of the samples from the seen categories. The second branch is a Stacked AutoEncoder (StAE) with manifold regularization, which utilizes semantic representations to predict samples from unseen categories. Note that both of the above branches are for body gesture recognition. We further add an emotion classifier with a softmax layer as the third branch in order to better learn the feature representations for this emotion classification task. The input features for these three branches are learned by a shared feature extraction network, i.e., a Bidirectional Long Short-Term Memory Networks (BLSTM) with a self-attention module. We treat these three branches as subtasks and use multi-task learning strategies for joint training. The performance of our framework on an emotion recognition dataset is significantly superior to the traditional method of emotion classification and state-of-the-art zero-shot learning methods.
Active learning (AL) attempts to maximize the performance gain of the model by marking the fewest samples. Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize massive parameters, so that the model learns how to extract high-quality features. In recent years, due to the rapid development of internet technology, we are in an era of information torrents and we have massive amounts of data. In this way, DL has aroused strong interest of researchers and has been rapidly developed. Compared with DL, researchers have relatively low interest in AL. This is mainly because before the rise of DL, traditional machine learning requires relatively few labeled samples. Therefore, early AL is difficult to reflect the value it deserves. Although DL has made breakthroughs in various fields, most of this success is due to the publicity of the large number of existing annotation datasets. However, the acquisition of a large number of high-quality annotated datasets consumes a lot of manpower, which is not allowed in some fields that require high expertise, especially in the fields of speech recognition, information extraction, medical images, etc. Therefore, AL has gradually received due attention. A natural idea is whether AL can be used to reduce the cost of sample annotations, while retaining the powerful learning capabilities of DL. Therefore, deep active learning (DAL) has emerged. Although the related research has been quite abundant, it lacks a comprehensive survey of DAL. This article is to fill this gap, we provide a formal classification method for the existing work, and a comprehensive and systematic overview. In addition, we also analyzed and summarized the development of DAL from the perspective of application. Finally, we discussed the confusion and problems in DAL, and gave some possible development directions for DAL.
Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards developing contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly considering available three modalities (language/text, audio, video) and trigger the appropriate functionality of the AV system. We had collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via realistic scavenger hunt game. In our previous explorations, we experimented with various RNN-based models to detect utterance-level intents (set destination, change route, go faster, go slower, stop, park, pull over, drop off, open door, and others) along with intent keywords and relevant slots (location, position/direction, object, gesture/gaze, time-guidance, person) associated with the action to be performed in our AV scenarios. In this recent work, we propose to discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input (text and speech embeddings) together with the non-verbal/acoustic and visual input from inside and outside the vehicle (i.e., passenger gestures and gaze from in-cabin video stream, referred objects outside of the vehicle from the road view camera stream). Our experimental results outperformed text-only baselines and with multimodality, we achieved improved performances for utterance-level intent detection and slot filling.
Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench  (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow, Pytorch, Intel OpenVINO etc. allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology.
Brain-related disorders such as epilepsy can be diagnosed by analyzing electroencephalograms (EEG). However, manual analysis of EEG data requires highly trained clinicians, and is a procedure that is known to have relatively low inter-rater agreement (IRA). Moreover, the volume of the data and the rate at which new data becomes available make manual interpretation a time-consuming, resource-hungry, and expensive process. In contrast, automated analysis of EEG data offers the potential to improve the quality of patient care by shortening the time to diagnosis and reducing manual error. In this paper, we focus on one of the first steps in interpreting an EEG session - identifying whether the brain activity is abnormal or normal. To solve this task, we propose a novel recurrent neural network (RNN) architecture termed ChronoNet which is inspired by recent developments from the field of image classification and designed to work efficiently with EEG data. ChronoNet is formed by stacking multiple 1D convolution layers followed by deep gated recurrent unit (GRU) layers where each 1D convolution layer uses multiple filters of exponentially varying lengths and the stacked GRU layers are densely connected in a feed-forward manner. We used the recently released TUH Abnormal EEG Corpus dataset for evaluating the performance of ChronoNet. Unlike previous studies using this dataset, ChronoNet directly takes time-series EEG as input and learns meaningful representations of brain activity patterns. ChronoNet outperforms the previously reported best results by 7.79% thereby setting a new benchmark for this dataset. Furthermore, we demonstrate the domain-independent nature of ChronoNet by successfully applying it to classify speech commands.
The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS challenge since August 2020. The current iteration of the CTS Challenge is a leaderboard-style speaker recognition evaluation using telephony data extracted from the unexposed portions of the Call My Net 2 (CMN2) and Multi-Language Speech (MLS) corpora collected by the LDC. The CTS Challenge is currently organized in a similar manner to the SRE19 CTS Challenge, offering only an open training condition using two evaluation subsets, namely Progress and Test. Unlike in the SRE19 Challenge, no training or development set was initially released, and NIST has publicly released the leaderboards on both subsets for the CTS Challenge. Which subset (i.e., Progress or Test) a trial belongs to is unknown to challenge participants, and each system submission needs to contain outputs for all of the trials. The CTS Challenge has also served, and will continue to do so, as a prerequisite for entrance to the regular SREs (such as SRE21). Since August 2020, a total of 53 organizations (forming 33 teams) from academia and industry have participated in the CTS Challenge and submitted more than 4400 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge. The CTS Challenge results thus far indicate remarkable improvements in performance due to 1) speaker embeddings extracted using large-scale and complex neural network architectures such as ResNets along with angular margin losses for speaker embedding extraction, 2) extensive data augmentation, 3) the use of large amounts of in-house proprietary data from a large number of labeled speakers, 4) long-duration fine-tuning.
Automating the recognition of outcomes reported in clinical trials using machine learning has a huge potential of speeding up access to evidence necessary in healthcare decision-making. Prior research has however acknowledged inadequate training corpora as a challenge for the Outcome detection (OD) task. Additionally, several contextualized representations like BERT and ELMO have achieved unparalleled success in detecting various diseases, genes, proteins, and chemicals, however, the same cannot be emphatically stated for outcomes, because these models have been relatively under-tested and studied for the OD task. We introduce "EBM-COMET", a dataset in which 300 PubMed abstracts are expertly annotated for clinical outcomes. Unlike prior related datasets that use arbitrary outcome classifications, we use labels from a taxonomy recently published to standardize outcome classifications. To extract outcomes, we fine-tune a variety of pre-trained contextualized representations, additionally, we use frozen contextualized and context-independent representations in our custom neural model augmented with clinically informed Part-Of-Speech embeddings and a cost-sensitive loss function. We adopt strict evaluation for the trained models by rewarding them for correctly identifying full outcome phrases rather than words within the entities i.e. given an outcome "systolic blood pressure", the models are rewarded a classification score only when they predict all 3 words in sequence, otherwise, they are not rewarded. We observe our best model (BioBERT) achieve 81.5\% F1, 81.3\% sensitivity and 98.0\% specificity. We reach a consensus on which contextualized representations are best suited for detecting outcomes from clinical-trial abstracts. Furthermore, our best model outperforms scores published on the original EBM-NLP dataset leader-board scores.