In this paper, we conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis and propose a common framework for evaluating these methods. We first identify several common issues in the current evaluation of text-to-image models, which are: (i) a commonly used metric for image quality assessment, e.g., Inception Score (IS), is often either miscalibrated for the single-object case or misused for the multi-object case; (ii) the overfitting phenomenon appears in the existing R-precision (RP) and SOA metrics, which are used to assess text relevance and object accuracy aspects, respectively; (iii) many vital factors in the evaluation of the multi-object case are primarily dismissed, e.g., object fidelity, positional alignment, counting alignment; (iv) the ranking of the methods based on current metrics is highly inconsistent with real images. Then, to overcome these limitations, we propose a combined set of existing and new metrics to systematically evaluate the methods. For existing metrics, we develop an improved version of IS named IS* by using temperature scaling to calibrate the confidence of the classifier used by IS; we also propose a solution to mitigate the overfitting issues of RP and SOA. Regarding a set of new metrics compensating for the lacking of vital evaluating factors in the multi-object case, we develop CA for counting alignment, PA for positional alignment, object-centric IS (O-IS), object-centric FID (O-FID) for object fidelity. Our benchmark, therefore, results in a highly consistent ranking among existing methods, being well-aligned to human evaluation. We also create a strong baseline model (AttnGAN++) for the benchmark by a simple modification from the well-known AttnGAN. We will release this toolbox for unified evaluation, so-called TISE, to standardize the evaluation of the text-to-image synthesis models.
We measure support with women and migrants in German political debates over the last 155 years. To do so, we (1) provide a gold standard of 1205 text snippets in context, annotated for support with our target groups, (2) train a BERT model on our annotated data, with which (3) we infer large-scale trends. These show that support with women is stronger than support with migrants, but both have steadily increased over time. While we hardly find any direct anti-support with women, there is more polarization when it comes to migrants. We also discuss the difficulty of annotation as a result of ambiguity in political discourse and indirectness, i.e., politicians' tendency to relate stances attributed to political opponents. Overall, our results indicate that German society, as measured from its political elite, has become fairer over time.
The explosion of e-commerce has caused the need for processing and analysis of product titles, like entity typing in product titles. However, the rapid activity in e-commerce has led to the rapid emergence of new entities, which is difficult to be solved by general entity typing. Besides, product titles in e-commerce have very different language styles from text data in general domain. In order to handle new entities in product titles and address the special language styles problem of product titles in e-commerce domain, we propose our textual entailment model with continuous prompt tuning based hypotheses and fusion embeddings for e-commerce entity typing. First, we reformulate the entity typing task into a textual entailment problem to handle new entities that are not present during training. Second, we design a model to automatically generate textual entailment hypotheses using a continuous prompt tuning method, which can generate better textual entailment hypotheses without manual design. Third, we utilize the fusion embeddings of BERT embedding and CharacterBERT embedding with a two-layer MLP classifier to solve the problem that the language styles of product titles in e-commerce are different from that of general domain. To analyze the effect of each contribution, we compare the performance of entity typing and textual entailment model, and conduct ablation studies on continuous prompt tuning and fusion embeddings. We also evaluate the impact of different prompt template initialization for the continuous prompt tuning. We show our proposed model improves the average F1 score by around 2% compared to the baseline BERT entity typing model.
Large pretrained language models (LMs) like BERT have improved performance in many disparate natural language processing (NLP) tasks. However, fine tuning such models requires a large number of training examples for each target task. Simultaneously, many realistic NLP problems are "few shot", without a sufficiently large training set. In this work, we propose a novel conditional neural process-based approach for few-shot text classification that learns to transfer from other diverse tasks with rich annotation. Our key idea is to represent each task using gradient information from a base model and to train an adaptation network that modulates a text classifier conditioned on the task representation. While previous task-aware few-shot learners represent tasks by input encoding, our novel task representation is more powerful, as the gradient captures input-output relationships of a task. Experimental results show that our approach outperforms traditional fine-tuning, sequential transfer learning, and state-of-the-art meta learning approaches on a collection of diverse few-shot tasks. We further conducted analysis and ablations to justify our design choices.
The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.
In the cascaded approach to spoken language translation (SLT), the ASR output is typically punctuated and segmented into sentences before being passed to MT, since the latter is typically trained on written text. However, erroneous segmentation, due to poor sentence-final punctuation by the ASR system, leads to degradation in translation quality, especially in the simultaneous (online) setting where the input is continuously updated. To reduce the influence of automatic segmentation, we present a sliding window approach to translate raw ASR outputs (online or offline) without needing to rely on an automatic segmenter. We train translation models using parallel windows (instead of parallel sentences) extracted from the original training data. At test time, we translate at the window level and join the translated windows using a simple approach to generate the final translation. Experiments on English-to-German and English-to-Czech show that our approach improves 1.3--2.0 BLEU points over the usual ASR-segmenter pipeline, and the fixed-length window considerably reduces flicker compared to a baseline retranslation-based online SLT system.
A majorization-minimization (MM) algorithm for independent vector analysis optimizes a separation matrix $W = [w_1, \ldots, w_m]^h \in \mathbb{C}^{m \times m}$ by minimizing a surrogate function of the form $\mathcal{L}(W) = \sum_{i = 1}^m w_i^h V_i w_i - \log | \det W |^2$, where $m \in \mathbb{N}$ is the number of sensors and positive definite matrices $V_1,\ldots,V_m \in \mathbb{C}^{m \times m}$ are constructed in each MM iteration. For $m \geq 3$, no algorithm has been found to obtain a global minimum of $\mathcal{L}(W)$. Instead, block coordinate descent (BCD) methods with closed-form update formulas have been developed for minimizing $\mathcal{L}(W)$ and shown to be effective. One such BCD is called iterative projection (IP) that updates one or two rows of $W$ in each iteration. Another BCD is called iterative source steering (ISS) that updates one column of the mixing matrix $A = W^{-1}$ in each iteration. Although the time complexity per iteration of ISS is $m$ times smaller than that of IP, the conventional ISS converges slower than the current fastest IP (called $\text{IP}_2$) that updates two rows of $W$ in each iteration. We here extend this ISS to $\text{ISS}_2$ that can update two columns of $A$ in each iteration while maintaining its small time complexity. To this end, we provide a unified way for developing new ISS type methods from which $\text{ISS}_2$ as well as the conventional ISS can be immediately obtained in a systematic manner. Numerical experiments to separate reverberant speech mixtures show that our $\text{ISS}_2$ converges in fewer MM iterations than the conventional ISS, and is comparable to $\text{IP}_2$.
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased interest in the research community due to its excellent research value and broad real-world application. It is designed for the scenarios where the queries are from one modality and the retrieval galleries from another modality. This paper presents a comprehensive and up-to-date survey on the ITR approaches from four perspectives. By dissecting an ITR system into two processes: feature extraction and feature alignment, we summarize the recent advance of the ITR approaches from these two perspectives. On top of this, the efficiency-focused study on the ITR system is introduced as the third perspective. To keep pace with the times, we also provide a pioneering overview of the cross-modal pre-training ITR approaches as the fourth perspective. Finally, we outline the common benchmark datasets and valuation metric for ITR, and conduct the accuracy comparison among the representative ITR approaches. Some critical yet less studied issues are discussed at the end of the paper.
We study the problem of generating counterfactual text for a classifier as a means for understanding and debugging classification. Given a textual input and a classification model, we aim to minimally alter the text to change the model's prediction. White-box approaches have been successfully applied to similar problems in vision where one can directly optimize the continuous input. Optimization-based approaches become difficult in the language domain due to the discrete nature of text. We bypass this issue by directly optimizing in the latent space and leveraging a language model to generate candidate modifications from optimized latent representations. We additionally use Shapley values to estimate the combinatoric effect of multiple changes. We then use these estimates to guide a beam search for the final counterfactual text. We achieve favorable performance compared to recent white-box and black-box baselines using human and automatic evaluations. Ablation studies show that both latent optimization and the use of Shapley values improve success rate and the quality of the generated counterfactuals.
The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition (ASR) based in the Chukchi language. There is no one complete corpus of the Chukchi language, so most of the work consisted in collecting audio and texts in the Chukchi language from open sources and processing them. We managed to collect 21:34:23 hours of audio recordings and 112,719 sentences (or 2,068,273 words) of text in the Chukchi language. The XLSR model was trained on the obtained data, which showed good results even with a small amount of data. Besides the fact that the Chukchi language is a low-resource language, it is also polysynthetic, which significantly complicates any automatic processing. Thus, the usual WER metric for evaluating ASR becomes less indicative for a polysynthetic language. However, the CER metric showed good results. The question of metrics for polysynthetic languages remains open.