Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets

Jun 17, 2015
Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aurélien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, Fei Sha

The computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems. We argue that this barrier can be effectively overcome. In particular, we develop methods to scale up kernel models to successfully tackle large-scale learning problems that are so far only approachable by deep learning architectures. Based on the seminal work by Rahimi and Recht on approximating kernel functions with features derived from random projections, we advance the state-of-the-art by proposing methods that can efficiently train models with hundreds of millions of parameters, and learn optimal representations from multiple kernels. We conduct extensive empirical studies on problems from image recognition and automatic speech recognition, and show that the performance of our kernel models matches that of well-engineered deep neural nets (DNNs). To the best of our knowledge, this is the first time that a direct comparison between these two methods on large-scale problems is reported. Our kernel methods have several appealing properties: training with convex optimization, cost for training a single model comparable to DNNs, and significantly reduced total cost due to fewer hyperparameters to tune for model selection. Our contrastive study between these two very different but equally competitive models sheds light on fundamental questions such as how to learn good representations.

  Access Paper or Ask Questions

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Apr 18, 2022
Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, Xiaodong He

Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly, to tackle the data imbalance issue, we design a simple yet effective data augmentation method to construct negative instances without supervision and apply contrastive learning to obtain better feature representations. Extensive experiments are conducted and the results demonstrate the superiority and competitiveness of our model over several state-of-the-art baselines.

* Accepted by ICASSP 2022 

  Access Paper or Ask Questions

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Mar 29, 2022
Fangyuan Wang, Bo Xu

Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory based methods. However, all of them have some limitations in aspects of global context modeling, linear computational complexity, and model parallelism. In this work, we aim to build a single model to achieve the benefits of all the three aspects for streaming E2E ASR. Particularly, we propose to use a shifted chunk mechanism instead of the conventional chunk mechanism for streaming Transformer and Conformer. This shifted chunk mechanism can significantly enhance modeling power through allowing chunk self-attention to capture global context across local chunks, while keeping linear computational complexity and parallel trainable. We name the Shifted Chunk Transformer and Conformer as SChunk-Transofromer and SChunk-Conformer, respectively. And we verify their performance on the widely used AISHELL-1 benckmark. Experiments show that the SChunk-Transformer and SChunk-Conformer achieve CER 6.43% and 5.77%, respectively. That surpasses the existing chunk-wise and memory based methods by a large margin, and is competitive even compared with the state-of-the-art time-restricted methods which have quadratic computational complexity.

  Access Paper or Ask Questions

ASL Trigger Recognition in Mixed Activity/Signing Sequences for RF Sensor-Based User Interfaces

Nov 18, 2021
Emre Kurtoglu, Ali C. Gurbuz, Evie A. Malaia, Darrin Griffin, Chris Crawford, Sevgi Z. Gurbuz

The past decade has seen great advancements in speech recognition for control of interactive devices, personal assistants, and computer interfaces. However, Deaf and hard-ofhearing (HoH) individuals, whose primary mode of communication is sign language, cannot use voice-controlled interfaces. Although there has been significant work in video-based sign language recognition, video is not effective in the dark and has raised privacy concerns in the Deaf community when used in the context of human ambient intelligence. RF sensors have been recently proposed as a new modality that can be effective under the circumstances where video is not. This paper considers the problem of recognizing a trigger sign (wake word) in the context of daily living, where gross motor activities are interwoven with signing sequences. The proposed approach exploits multiple RF data domain representations (time-frequency, range-Doppler, and range-angle) for sequential classification of mixed motion data streams. The recognition accuracy of signs with varying kinematic properties is compared and used to make recommendations on appropriate trigger sign selection for RFsensor based user interfaces. The proposed approach achieves a trigger sign detection rate of 98.9% and a classification accuracy of 92% for 15 ASL words and 3 gross motor activities.

* 13 pages, 10 figures 

  Access Paper or Ask Questions

Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space

Aug 21, 2021
Labib Chowdhury, Mustafa Kamal, Najia Hasan, Nabeel Mohammed

Deep learning models have become an increasingly preferred option for biometric recognition systems, such as speaker recognition. SincNet, a deep neural network architecture, gained popularity in speaker recognition tasks due to its parameterized sinc functions that allow it to work directly on the speech signal. The original SincNet architecture uses the softmax loss, which may not be the most suitable choice for recognition-based tasks. Such loss functions do not impose inter-class margins nor differentiate between easy and hard training samples. Curriculum learning, particularly those leveraging angular margin-based losses, has proven very successful in other biometric applications such as face recognition. The advantage of such a curriculum learning-based techniques is that it will impose inter-class margins as well as taking to account easy and hard samples. In this paper, we propose Curricular SincNet (CL-SincNet), an improved SincNet model where we use a curricular loss function to train the SincNet architecture. The proposed model is evaluated on multiple datasets using intra-dataset and inter-dataset evaluation protocols. In both settings, the model performs competitively with other previously published work. In the case of inter-dataset testing, it achieves the best overall results with a reduction of 4\% error rate compare to SincNet and other published work.

* Accepted at 20th International Conference of the Biometrics Special Interest Group (BIOSIG 2021) 

  Access Paper or Ask Questions

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Jan 05, 2021
Hyeongmin Cho, Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

  Access Paper or Ask Questions

Semantic Task Planning for Service Robots in Open World

Nov 01, 2020
Guowei Cui, Wei Shuai, Xiaoping Chen

In this paper, we present a planning system based on semantic reasoning for a general-purpose service robot, which is aimed at behaving more intelligently in domains that contain incomplete information, under-specified goals, and dynamic changes. First, Two kinds of data are generated by Natural Language Processing module from the speech: (i) action frames and their relationships; (ii) the modifier used to indicate some property or characteristic of a variable in the action frame. Next, the goals of the task are generated from these action frames and modifiers. These goals are represented as AI symbols, combining world state and domain knowledge, which are used to generate plans by an Answer Set Programming solver. Finally, the actions of the plan are executed one by one, and continuous sensing grounds useful information, which make the robot to use contingent knowledge to adapt to dynamic changes and faults. For each action in the plan, the planner gets its preconditions and effects from domain knowledge, so during the execution of the task, the environmental changes, especially those conflict with the actions, not only the action being performed, but also the subsequent actions, can be detected and handled as early as possible. A series of case studies are used to evaluate the system and verify its ability to acquire knowledge through dialogue with users, solve problems with the acquired causal knowledge, and plan for complex tasks autonomously in the open world.

  Access Paper or Ask Questions

Dense Relational Image Captioning via Multi-task Triple-Stream Networks

Oct 12, 2020
Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon

We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions of each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS, i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to not only learn to generate captions but also predict the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. We additionally extend analysis to an ablation study, applications on holistic image captioning, scene graph generation, and retrieval tasks.

* Journal extension of our CVPR 2019 paper ( arXiv:1903.05942 ). Source code : 

  Access Paper or Ask Questions

Discovering Lexical Similarity Through Articulatory Feature-based Phonetic Edit Distance

Aug 16, 2020
Tafseer Ahmed, Muhammad Suffian Nizami, Muhammad Yaseen Khan

Lexical Similarity (LS) between two languages uncovers many interesting linguistic insights such as genetic relationship, mutual intelligibility, and the usage of one's vocabulary into other. There are various methods through which LS is evaluated. In the same regard, this paper presents a method of Phonetic Edit Distance (PED) that uses a soft comparison of letters using the articulatory features associated with them. The system converts the words into the corresponding International Phonetic Alphabet (IPA), followed by the conversion of IPA into its set of articulatory features. Later, the lists of the set of articulatory features are compared using the proposed method. As an example, PED gives edit distance of German word vater and Persian word pidar as 0.82; and similarly, Hebrew word shalom and Arabic word salaam as 0.93, whereas for a juxtapose comparison, their IPA based edit distances are 4 and 2 respectively. Experiments are performed with six languages (Arabic, Hindi, Marathi, Persian, Sanskrit, and Urdu). In this regard, we extracted part of speech wise word-lists from the Universal Dependency corpora and evaluated the LS for every pair of language. Thus, with the proposed approach, we find the genetic affinity, similarity, and borrowing/loan-words despite having script differences and sound variation phenomena among these languages.

  Access Paper or Ask Questions