Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.
Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines ten corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform at ritual.uh.edu/lince, where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.
Recently, deep learning has been playing a central role in machine learning research and applications. Since AlexNet, increasingly more advanced networks have achieved state-of-the-art performance in computer vision, speech recognition, language processing, game playing, medical imaging, and so on. In our previous studies, we proposed quadratic/second-order neurons and deep quadratic neural networks. In a quadratic neuron, the inner product of a vector of data and the corresponding weights in a conventional neuron is replaced with a quadratic function. The resultant second-order neuron enjoys an enhanced expressive capability over the conventional neuron. However, how quadratic neurons improve the expressing capability of a deep quadratic network has not been studied up to now, preferably in relation to that of a conventional neural network. In this paper, we ask three basic questions regarding the expressive capability of a quadratic network: (1) for the one-hidden-layer network structure, is there any function that a quadratic network can approximate much more efficiently than a conventional network? (2) for the same multi-layer network structure, is there any function that can be expressed by a quadratic network but cannot be expressed with conventional neurons in the same structure? (3) Does a quadratic network give a new insight into universal approximation? Our main contributions are the three theorems shedding light upon these three questions and demonstrating the merits of a quadratic network in terms of expressive efficiency, unique capability, and compact architecture respectively.
Deep Neural Networks(DNN) have excessively advanced the field of computer vision by achieving state of the art performance in various vision tasks. These results are not limited to the field of vision but can also be seen in speech recognition and machine translation tasks. Recently, DNNs are found to poorly fail when tested with samples that are crafted by making imperceptible changes to the original input images. This causes a gap between the validation and adversarial performance of a DNN. An effective and generalizable robustness metric for evaluating the performance of DNN on these adversarial inputs is still missing from the literature. In this paper, we propose Noise Sensitivity Score (NSS), a metric that quantifies the performance of a DNN on a specific input under different forms of fix-directional attacks. An insightful mathematical explanation is provided for deeply understanding the proposed metric. By leveraging the NSS, we also proposed a skewness based dataset robustness metric for evaluating a DNN's adversarial performance on a given dataset. Extensive experiments using widely used state of the art architectures along with popular classification datasets, such as MNIST, CIFAR-10, CIFAR-100, and ImageNet, are used to validate the effectiveness and generalization of our proposed metrics. Instead of simply measuring a DNN's adversarial robustness in the input domain, as previous works, the proposed NSS is built on top of insightful mathematical understanding of the adversarial attack and gives a more explicit explanation of the robustness.
Sequence-to-sequence (Seq2seq) models have played an important role in the recent success of various natural language processing methods, such as machine translation, text summarization, and speech recognition. However, current Seq2seq models have trouble preserving global latent information from a long sequence of words. Variational autoencoder (VAE) alleviates this problem by learning a continuous semantic space of the input sentence. However, it does not solve the problem completely. In this paper, we propose a new recurrent neural network (RNN)-based Seq2seq model, RNN semantic variational autoencoder (RNN--SVAE), to better capture the global latent information of a sequence of words. To reflect the meaning of words in a sentence properly, without regard to its position within the sentence, we construct a document information vector using the attention information between the final state of the encoder and every prior hidden state. Then, the mean and standard deviation of the continuous semantic space are learned by using this vector to take advantage of the variational method. By using the document information vector to find the semantic space of the sentence, it becomes possible to better capture the global latent feature of the sentence. Experimental results of three natural language tasks (i.e., language modeling, missing word imputation, paraphrase identification) confirm that the proposed RNN--SVAE yields higher performance than two benchmark models.
Trans-dimensional random field language models (TRF LMs) have recently been introduced, where sentences are modeled as a collection of random fields. The TRF approach has been shown to have the advantages of being computationally more efficient in inference than LSTM LMs with close performance and being able to flexibly integrating rich features. In this paper we propose neural TRFs, beyond of the previous discrete TRFs that only use linear potentials with discrete features. The idea is to use nonlinear potentials with continuous features, implemented by neural networks (NNs), in the TRF framework. Neural TRFs combine the advantages of both NNs and TRFs. The benefits of word embedding, nonlinear feature learning and larger context modeling are inherited from the use of NNs. At the same time, the strength of efficient inference by avoiding expensive softmax is preserved. A number of technical contributions, including employing deep convolutional neural networks (CNNs) to define the potentials and incorporating the joint stochastic approximation (JSA) strategy in the training algorithm, are developed in this work, which enable us to successfully train neural TRF LMs. Various LMs are evaluated in terms of speech recognition WERs by rescoring the 1000-best lists of WSJ'92 test data. The results show that neural TRF LMs not only improve over discrete TRF LMs, but also perform slightly better than LSTM LMs with only one fifth of parameters and 16x faster inference efficiency.
Recognizer Output Voting Error Reduction (ROVER) has been widely used for system combination in automatic speech recognition (ASR). In order to select the most appropriate words to insert at each position in the output transcriptions, some ROVER extensions rely on critical information such as confidence scores and other ASR decoder features. This information, which is not always available, highly depends on the decoding process and sometimes tends to over estimate the real quality of the recognized words. In this paper we propose a novel variant of ROVER that takes advantage of ASR quality estimation (QE) for ranking the transcriptions at "segment level" instead of: i) relying on confidence scores, or ii) feeding ROVER with randomly ordered hypotheses. We first introduce an effective set of features to compensate for the absence of ASR decoder information. Then, we apply QE techniques to perform accurate hypothesis ranking at segment-level before starting the fusion process. The evaluation is carried out on two different tasks, in which we respectively combine hypotheses coming from independent ASR systems and multi-microphone recordings. In both tasks, it is assumed that the ASR decoder information is not available. The proposed approach significantly outperforms standard ROVER and it is competitive with two strong oracles that e xploit prior knowledge about the real quality of the hypotheses to be combined. Compared to standard ROVER, the abs olute WER improvements in the two evaluation scenarios range from 0.5% to 7.3%.
Building robust and general dialogue models for spoken conversations is challenging due to the gap in distributions of spoken and written data. This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10. In order to mitigate the discrepancies between spoken and written text, we mainly employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation. To train robust models for spoken conversations, we improve pre-trained language models, and apply ensemble algorithms for each sub-task. Typically, for the detection task, we fine-tune \roberta and ELECTRA, and run an error-fixing ensemble algorithm. For the selection task, we adopt a two-stage framework that consists of entity tracking and knowledge ranking, and propose a multi-task learning method to learn multi-level semantic information by domain classification and entity selection. For the generation task, we adopt a cross-validation data process to improve pre-trained generative language models, followed by a consensus decoding algorithm, which can add arbitrary features like relative \rouge metric, and tune associated feature weights toward \bleu directly. Our approach ranks third on the objective evaluation and second on the final official human evaluation.
As transfer learning from large-scale pre-trained language models has become prevalent in Natural Language Processing, running these models in computationally constrained environments remains a challenging problem yet to address. Several solutions including knowledge distillation, network quantization or network pruning have been proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base and DistilMulti-BERT-base-ro. The first two models resulted from individually distilling the knowledge of the two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity and dialect identification. The experimental results on these benchmarks proved that our three distilled models maintain most performance in terms of accuracy with their teachers, while being twice as fast on a GPU and ~35\% smaller. In addition, we further test the similarity between our students and their teachers prediction by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.
Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.