This work aims to design a low complexity spoken command recognition (SCR) system by considering different trade-offs between the number of model parameters and classification accuracy. More specifically, we exploit a deep hybrid architecture of a tensor-train (TT) network to build an end-to-end SRC pipeline. Our command recognition system, namely CNN+(TT-DNN), is composed of convolutional layers at the bottom for spectral feature extraction and TT layers at the top for command classification. Compared with a traditional end-to-end CNN baseline for SCR, our proposed CNN+(TT-DNN) model replaces fully connected (FC) layers with TT ones and it can substantially reduce the number of model parameters while maintaining the baseline performance of the CNN model. We initialize the CNN+(TT-DNN) model in a randomized manner or based on a well-trained CNN+DNN, and assess the CNN+(TT-DNN) models on the Google Speech Command Dataset. Our experimental results show that the proposed CNN+(TT-DNN) model attains a competitive accuracy of 96.31% with 4 times fewer model parameters than the CNN model. Furthermore, the CNN+(TT-DNN) model can obtain a 97.2% accuracy when the number of parameters is increased.
We are approaching a future where social robots will progressively become widespread in many aspects of our daily lives, including education, healthcare, work, and personal use. All of such practical applications require that humans and robots collaborate in human environments, where social interaction is unavoidable. Along with verbal communication, successful social interaction is closely coupled with the interplay between nonverbal perception and action mechanisms, such as observation of gaze behaviour and following their attention, coordinating the form and function of hand gestures. Humans perform nonverbal communication in an instinctive and adaptive manner, with no effort. For robots to be successful in our social landscape, they should therefore engage in social interactions in a humanlike way, with increasing levels of autonomy. In particular, nonverbal gestures are expected to endow social robots with the capability of emphasizing their speech, or showing their intentions. Motivated by this, our research sheds a light on modeling human behaviors in social interactions, specifically, forecasting human nonverbal social signals during dyadic interactions, with an overarching goal of developing robotic interfaces that can learn to imitate human dyadic interactions. Such an approach will ensure the messages encoded in the robot gestures could be perceived by interacting partners in a facile and transparent manner, which could help improve the interacting partner perception and makes the social interaction outcomes enhanced.
It is presented here a machine learning-based (ML) natural language processing (NLP) approach capable to automatically recognize and extract categorical and numerical parameters from a corpus of articles. The approach (named a.RIX) operates with a concomitant/interchangeable use of ML models such as neuron networks (NNs), latent semantic analysis (LSA), naive-Bayes classifiers (NBC), and a pattern recognition model using regular expression (REGEX). A corpus of 7,873 scientific articles dealing with natural products (NPs) was used to demonstrate the efficiency of the a.RIX engine. The engine automatically extracts categorical and numerical parameters such as (i) the plant species from which active molecules are extracted, (ii) the microorganisms species for which active molecules can act against, and (iii) the values of minimum inhibitory concentration (MIC) against these microorganisms. The parameters are extracted without part-of-speech tagging (POS) and named entity recognition (NER) approaches (i.e. without the need of text annotation), and the models training is performed with unsupervised approaches. In this way, a.RIX can be essentially used on articles from any scientific field. Finally, it can potentially make obsolete the current article reviewing process in some areas, especially those in which machine learning models capture texts structure, text semantics, and latent knowledge.
Convolutional neural networks (CNNs), inspired by biological visual cortex systems, are a powerful category of artificial neural networks that can extract the hierarchical features of raw data to greatly reduce the network parametric complexity and enhance the predicting accuracy. They are of significant interest for machine learning tasks such as computer vision, speech recognition, playing board games and medical diagnosis. Optical neural networks offer the promise of dramatically accelerating computing speed to overcome the inherent bandwidth bottleneck of electronics. Here, we demonstrate a universal optical vector convolutional accelerator operating beyond 10 TeraFLOPS (floating point operations per second), generating convolutions of images of 250,000 pixels with 8 bit resolution for 10 kernels simultaneously, enough for facial image recognition. We then use the same hardware to sequentially form a deep optical CNN with ten output neurons, achieving successful recognition of full 10 digits with 900 pixel handwritten digit images with 88% accuracy. Our results are based on simultaneously interleaving temporal, wavelength and spatial dimensions enabled by an integrated microcomb source. This approach is scalable and trainable to much more complex networks for demanding applications such as unmanned vehicle and real-time video recognition.
Sequential sensor data is generated in a wide variety of practical applications. A fundamental challenge involves learning effective classifiers for such sequential data. While deep learning has led to impressive performance gains in recent years in domains such as speech, this has relied on the availability of large datasets of sequences with high-quality labels. In many applications, however, the associated class labels are often extremely limited, with precise labelling/segmentation being too expensive to perform at a high volume. However, large amounts of unlabeled data may still be available. In this paper we propose a novel framework for semi-supervised learning in such contexts. In an unsupervised manner, change point detection methods can be used to identify points within a sequence corresponding to likely class changes. We show that change points provide examples of similar/dissimilar pairs of sequences which, when coupled with labeled, can be used in a semi-supervised classification setting. Leveraging the change points and labeled data, we form examples of similar/dissimilar sequences to train a neural network to learn improved representations for classification. We provide extensive synthetic simulations and show that the learned representations are superior to those learned through an autoencoder and obtain improved results on both simulated and real-world human activity recognition datasets.
Deep neural network models are becoming popular and have used in various tasks such as computer vision, speech recognition, and natural language processing. It is often the case that the training phase of a model is executed in one environment, while the inference phase is executed in another environment. This is because the optimization characteristics for each phase significantly differ. Therefore, it is critical to efficiently compile a trained model for inferencing on different environments. To represent neural network models, users often use Open Neural Network Exchange (ONNX) which is an open standard format for machine learning interoperability. We are developing a compiler for rewriting a model in ONNX into a standalone binary that is executable on different target hardwares such as x86 machines, IBM Power Systems, and IBM System Z. The compiler was written using Multi-level Intermediate Representation (MLIR), a modern compiler infrastructure. In particular, we introduce two internal representations: ONNX IR for representing ONNX operators, and Kernel IR for efficiently lowering ONNX operators into LLVM bitcode. In this paper, we will discuss the overall structure of our compiler and give some practical examples of converting ONNX operators and models. We also cover several issues related to endianness. Our framework is publicly available as an open source project under the ONNX project.
Deep Neural Networks (DNNs) are being used in various daily tasks such as object detection, speech processing, and machine translation. However, it is known that DNNs suffer from robustness problems -- perturbed inputs called adversarial samples leading to misbehaviors of DNNs. In this paper, we propose a black-box technique called Black-box Momentum Iterative Fast Gradient Sign Method (BMI-FGSM) to test the robustness of DNN models. The technique does not require any knowledge of the structure or weights of the target DNN. Compared to existing white-box testing techniques that require accessing model internal information such as gradients, our technique approximates gradients through Differential Evolution and uses approximated gradients to construct adversarial samples. Experimental results show that our technique can achieve 100% success in generating adversarial samples to trigger misclassification, and over 95% success in generating samples to trigger misclassification to a specific target output label. It also demonstrates better perturbation distance and better transferability. Compared to the state-of-the-art black-box technique, our technique is more efficient. Furthermore, we conduct testing on the commercial Aliyun API and successfully trigger its misbehavior within a limited number of queries, demonstrating the feasibility of real-world black-box attack.
Summaries generated from medical conversations can improve recall and understanding of care plans for patients and reduce documentation burden for doctors. Recent advancements in automatic speech recognition (ASR) and natural language understanding (NLU)offer potential solutions to generate these summaries automatically. In the current paper, we focus on two tasks: classifying utterances from medical conversations according to (i)the SOAP section and (ii) the speaker role, both fundamental building blocks along the path towards an end-to-end, automated SOAP note for medical conversations. We provide details on a dataset that contains human and ASR transcriptions of medical conversations and corresponding machine learning optimized SOAP notes. We then present a systematic and rigorous analysis in which we adapt an existing deep learning architecture to the two aforementioned tasks. The results suggest that modelling context in a hierarchical manner, which captures both word and utterance level context, yields substantial improvements on both classification tasks. Additionally, we develop and analyze a modular method for adapting our model to ASR output. Our work fills an important gap by providing a quantitative baseline for benchmarking future research on the automation of SOAP notes.We discuss its implications for future research on using deep learning to automate clinical documentation from medical conversations.