Speech emotion recognition is an important aspect of human-computer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19% in the weighted accuracy and 63.21% in the unweighted accuracy. It outperforms the baseline system that is optimized without the contrastive loss function with 1.14% and 2.55% in the weighted accuracy and the unweighted accuracy, respectively.
Fuzzy systems have achieved great success in numerous applications. However, there are still many challenges in designing an optimal fuzzy system, e.g., how to efficiently train its parameters, how to improve its performance without adding too many parameters, how to balance the trade-off between cooperations and competitions among the rules, how to overcome the curse of dimensionality, etc. Literature has shown that by making appropriate connections between fuzzy systems and other machine learning approaches, good practices from other domains may be used to improve the fuzzy systems, and vice versa. This paper gives an overview on the functional equivalence between Takagi-Sugeno-Kang fuzzy systems and four classic machine learning approaches -- neural networks, mixture of experts, classification and regression trees, and stacking ensemble regression -- for regression problems. We also point out some promising new research directions, inspired by the functional equivalence, that could lead to solutions to the aforementioned problems. To our knowledge, this is so far the most comprehensive overview on the connections between fuzzy systems and other popular machine learning approaches, and hopefully will stimulate more hybridization between different machine learning algorithms.
Speech emotion recognition is an important aspect of human-computer interaction. Prior works propose various transfer learning approaches to deal with limited samples in speech emotion recognition. However, they require labeled data for the source task, which cost much effort to collect them. To solve this problem, we focus on the unsupervised task, predictive coding. Nearly unlimited data for most domains can be utilized. In this paper, we utilize the multi-layer Transformer model for the predictive coding, followed with transfer learning approaches to share knowledge of the pre-trained predictive model for speech emotion recognition. We conduct experiments on IEMOCAP, and experimental results reveal the advantages of the proposed method. Our method reaches 65.03% in the weighted accuracy, which also outperforms some currently advanced approaches.
We propose a semismooth Newton algorithm for pathwise optimization (SNAP) for the LASSO and Enet in sparse, high-dimensional linear regression. SNAP is derived from a suitable formulation of the KKT conditions based on Newton derivatives. It solves the semismooth KKT equations efficiently by actively and continuously seeking the support of the regression coefficients along the solution path with warm start. At each knot in the path, SNAP converges locally superlinearly for the Enet criterion and achieves an optimal local convergence rate for the LASSO criterion, i.e., SNAP converges in one step at the cost of two matrix-vector multiplication per iteration. Under certain regularity conditions on the design matrix and the minimum magnitude of the nonzero elements of the target regression coefficients, we show that SNAP hits a solution with the same signs as the regression coefficients and achieves a sharp estimation error bound in finite steps with high probability. The computational complexity of SNAP is shown to be the same as that of LARS and coordinate descent algorithms per iteration. Simulation studies and real data analysis support our theoretical results and demonstrate that SNAP is faster and accurate than LARS and coordinate descent algorithms.
Automatic emotion recognition is a challenging task. In this paper, we present our effort for the audio-video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2018 challenge, which requires participants to assign a single emotion label to the video clip from the six universal emotions (Anger, Disgust, Fear, Happiness, Sad and Surprise) and Neutral. The proposed multimodal emotion recognition system takes audio, video and text information into account. Except for handcraft features, we also extract bottleneck features from deep neutral networks (DNNs) via transfer learning. Both temporal classifiers and non-temporal classifiers are evaluated to obtain the best unimodal emotion classification result. Then possibilities are extracted and passed into the Beam Search Fusion (BS-Fusion). We test our method in the EmotiW 2018 challenge and we gain promising results. Compared with the baseline system, there is a significant improvement. We achieve 60.34% accuracy on the testing dataset, which is only 1.5% lower than the winner. It shows that our method is very competitive.
Acquisition of labeled training samples for affective computing is usually costly and time-consuming, as affects are intrinsically subjective, subtle and uncertain, and hence multiple human assessors are needed to evaluate each affective sample. Particularly, for affect estimation in the 3D space of valence, arousal and dominance, each assessor has to perform the evaluations in three dimensions, which makes the labeling problem even more challenging. Many sophisticated machine learning approaches have been proposed to reduce the data labeling requirement in various other domains, but so far few have considered affective computing. This paper proposes two multi-task active learning for regression approaches, which select the most beneficial samples to label, by considering the three affect primitives simultaneously. Experimental results on the VAM corpus demonstrated that our optimal sample selection approaches can result in better estimation performance than random selection and several traditional single-task active learning approaches. Thus, they can help alleviate the data labeling problem in affective computing, i.e., better estimation performance can be obtained from fewer labeling queries.
Regression problems are pervasive in real-world applications. Generally a substantial amount of labeled samples are needed to build a regression model with good generalization ability. However, many times it is relatively easy to collect a large number of unlabeled samples, but time-consuming or expensive to label them. Active learning for regression (ALR) is a methodology to reduce the number of labeled samples, by selecting the most beneficial ones to label, instead of random selection. This paper proposes two new ALR approaches based on greedy sampling (GS). The first approach (GSy) selects new samples to increase the diversity in the output space, and the second (iGS) selects new samples to increase the diversity in both input and output spaces. Extensive experiments on 12 UCI and CMU StatLib datasets from various domains, and on 15 subjects on EEG-based driver drowsiness estimation, verified their effectiveness and robustness.
In this paper, we consider the problem of recovering a sparse signal based on penalized least squares. We develop an algorithm of primal-dual active set type for a class of nonconvex sparsity-promoting penalties, which includes $\ell^0$, bridge, smoothly clipped absolute deviation, capped $\ell^1$ and minimax concavity penalty. First we establish the existence of a global minimizer for the class of optimization problems. Then we derive a novel necessary optimality condition for the global minimizer using the associated thresholding operator. The solutions to the optimality system are coordinate-wise minimizers, and under minor conditions, they are also local minimizers. Upon introducing the dual variable, the active set can be determined from the primal and dual variables. This relation lends itself to an iterative algorithm of active set type which at each step involves updating the primal variable only on the active set and then updating the dual variable explicitly. When combined with a continuation strategy on the regularization parameter, the primal dual active set method converges globally to the underlying regression target under some regularity conditions. Extensive numerical experiments demonstrate its superior performance in efficiency and accuracy compared with the existing methods.
In this work, we pose the question of whether, by considering qualitative information such as a sample target image as input, one can produce a rendered image of scientific data that is similar to the target. The algorithm resulting from our research allows one to ask the question of whether features like those in the target image exists in a given dataset. In that way, our method is one of imagery query or reverse engineering, as opposed to manual parameter tweaking of the full visualization pipeline. For target images, we can use real-world photographs of physical phenomena. Our method leverages deep neural networks and evolutionary optimization. Using a trained similarity function that measures the difference between renderings of a phenomenon and real-world photographs, our method optimizes rendering parameters. We demonstrate the efficacy of our method using a superstorm simulation dataset and images found online. We also discuss a parallel implementation of our method, which was run on NCSA's Blue Waters.
We propose an algorithm, semismooth Newton coordinate descent (SNCD), for the elastic-net penalized Huber loss regression and quantile regression in high dimensional settings. Unlike existing coordinate descent type algorithms, the SNCD updates each regression coefficient and its corresponding subgradient simultaneously in each iteration. It combines the strengths of the coordinate descent and the semismooth Newton algorithm, and effectively solves the computational challenges posed by dimensionality and nonsmoothness. We establish the convergence properties of the algorithm. In addition, we present an adaptive version of the "strong rule" for screening predictors to gain extra efficiency. Through numerical experiments, we demonstrate that the proposed algorithm is very efficient and scalable to ultra-high dimensions. We illustrate the application via a real data example.