Despite the great development of document summarization techniques nowadays, factual inconsistencies between the generated summaries and the original text still occur from time to time. This paper proposes a prefix-tuning-based approach that uses a set of trainable continuous prefix prompt together with discrete prompts to aid model generation, which makes a significant impact on both CNN/Daily Mail and XSum summaries generated using GPT-2. The improvements on fact preservation in the generated summaries indicates the effectiveness of adopting this prefix-tuning-based method in knowledge-enhanced document summarization, and also shows a great potential on other natural language processing tasks.
Recent mask proposal models have significantly improved the performance of zero-shot semantic segmentation. However, the use of a `background' embedding during training in these methods is problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels. Furthermore, they ignore the semantic relationship of text embeddings, which arguably can be highly informative for zero-shot prediction as seen classes may have close relationship with unseen classes. To this end, this paper proposes novel class enhancement losses to bypass the use of the background embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores. To further capture the relationship between seen and unseen classes, we propose an effective pseudo label generation pipeline using pretrained vision-language model. Extensive experiments on several benchmark datasets show that our method achieves overall the best performance for zero-shot semantic segmentation. Our method is flexible, and can also be applied to the challenging open-vocabulary semantic segmentation problem.
Contrary to its original interpretation as a facilitator of knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is instead a form of regularization. Perhaps the strongest support of all for this claim is found in its apparent similarities with label smoothing (LS). This paper investigates the stated equivalence of these two methods by examining the predictive uncertainties of the models they train. Experiments on four text classification tasks involving teachers and students of different capacities show that: (a) In most settings, KD and LS drive model uncertainty (entropy) in completely opposite directions, and (b) In KD, the student's predictive uncertainty is a direct function of that of its teacher, reinforcing the knowledge transfer view.
In this paper, we explore the application of language and speech technology to open-ended questions in a Dutch panel survey. In an experimental wave respondents could choose to answer open questions via speech or keyboard. Automatic speech recognition (ASR) was used to process spoken responses. We evaluated answers from these input modalities to investigate differences between spoken and typed answers.We report the errors the ASR system produces and investigate the impact of these errors on downstream analyses. Open-ended questions give more freedom to answer for respondents, but entail a non-trivial amount of work to analyse. We evaluated the feasibility of using transformer-based models (e.g. BERT) to apply sentiment analysis and topic modelling on the answers of open questions. A big advantage of transformer-based models is that they are trained on a large amount of language materials and do not necessarily need training on the target materials. This is especially advantageous for survey data, which does not contain a lot of text materials. We tested the quality of automatic sentiment analysis by comparing automatic labeling with three human raters and tested the robustness of topic modelling by comparing the generated models based on automatic and manually transcribed spoken answers.
In recent years Generative Machine Learning systems have advanced significantly. A current wave of generative systems use text prompts to create complex imagery, video, even 3D datasets. The creators of these systems claim a revolution in bringing creativity and art to anyone who can type a prompt. In this position paper, we question the basis for these claims, dividing our analysis into three areas: the limitations of linguistic descriptions, implications of the dataset, and lastly, matters of materiality and embodiment. We conclude with an analysis of the creative possibilities enabled by prompt-based systems, asking if they can be considered a new artistic medium.
Disentangled representation learning remains challenging as ground truth factors of variation do not naturally exist. To address this, we present Vocabulary Disentanglement Retrieval~(VDR), a simple yet effective retrieval-based disentanglement framework that leverages nature language as distant supervision. Our approach is built upon the widely-used bi-encoder architecture with disentanglement heads and is trained on data-text pairs that are readily available on the web or in existing datasets. This makes our approach task- and modality-agnostic with potential for a wide range of downstream applications. We conduct experiments on 16 datasets in both text-to-text and cross-modal scenarios and evaluate VDR in a zero-shot setting. With the incorporation of disentanglement heads and a minor increase in parameters, VDR achieves significant improvements over the base retriever it is built upon, with a 9% higher on NDCG@10 scores in zero-shot text-to-text retrieval and an average of 13% higher recall in cross-modal retrieval. In comparison to other baselines, VDR outperforms them in most tasks, while also improving explainability and efficiency.
In this paper we propose a new Deep Learning (DL) approach for message classification. Our method is based on the state-of-the-art Natural Language Processing (NLP) building blocks, combined with a novel technique for infusing the meta-data input that is typically available in messages such as the sender information, timestamps, attached image, audio, affiliations, and more. As we demonstrate throughout the paper, going beyond the mere text by leveraging all available channels in the message, could yield an improved representation and higher classification accuracy. To achieve message representation, each type of input is processed in a dedicated block in the neural network architecture that is suitable for the data type. Such an implementation enables training all blocks together simultaneously, and forming cross channels features in the network. We show in the Experiments Section that in some cases, message's meta-data holds an additional information that cannot be extracted just from the text, and when using this information we achieve better performance. Furthermore, we demonstrate that our multi-modality block approach outperforms other approaches for injecting the meta data to the the text classifier.
Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.
Large-scale foundation models (e.g., CLIP) have shown promising zero-shot generalization performance on downstream tasks by leveraging carefully designed language prompts. However, despite their success, most prompt learning techniques tend to underperform in the presence of domain shift. Our study addresses this problem and, to improve CLIP's generalization ability across domains, proposes \textsc{StyLIP}, a novel approach for Domain Generalization (DG) based on a domain-agnostic prompt learning strategy. In the absence of explicit domain knowledge, we aim to disentangle the visual style and the content information extracted from the pre-trained CLIP in the prompts so they can be effortlessly adapted to novel domains during inference. Furthermore, we consider a set of style projectors to learn the prompt tokens directly from these multi-scale style features, and the generated prompt embeddings are later fused with the multi-scale visual features learned through a content projector. The projectors are contrastively trained, given CLIP's frozen vision and text encoders. We present extensive experiments in five different DG settings on multiple benchmarks, demonstrating that \textsc{StyLIP} consistently outperforms the relevant state-of-the-art methods.
End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En$\rightarrow$DE dataset.\footnote{Our code is available at https://github.com/mingzi151/w2v2-st.}