We study the task of conversational fashion image retrieval via multiturn natural language feedback. Most previous studies are based on single-turn settings. Existing models on multiturn conversational fashion image retrieval have limitations, such as employing traditional models, and leading to ineffective performance. We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts. One characteristic of the framework is that it searches for candidate images based on exploitation of the encoded reference image and feedback text information together with the conversation history. Furthermore, the image fashion attribute information is leveraged via a mutual attention strategy. Since there is no existing fashion dataset suitable for the multiturn setting of our task, we derive a large-scale multiturn fashion dataset via additional manual annotation efforts on an existing single-turn dataset. The experiments show that our proposed model significantly outperforms existing state-of-the-art methods.
We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast "what happens next" given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.
Neural topic models have triggered a surge of interest in extracting topics from text automatically since they avoid the sophisticated derivations in conventional topic models. However, scarce neural topic models incorporate the word relatedness information captured in word embedding into the modeling process. To address this issue, we propose a novel topic modeling approach, called Variational Gaussian Topic Model (VaGTM). Based on the variational auto-encoder, the proposed VaGTM models each topic with a multivariate Gaussian in decoder to incorporate word relatedness. Furthermore, to address the limitation that pre-trained word embeddings of topic-associated words do not follow a multivariate Gaussian, Variational Gaussian Topic Model with Invertible neural Projections (VaGTM-IP) is extended from VaGTM. Three benchmark text corpora are used in experiments to verify the effectiveness of VaGTM and VaGTM-IP. The experimental results show that VaGTM and VaGTM-IP outperform several competitive baselines and obtain more coherent topics.
Relational knowledge bases (KBs) are established tools for world knowledge representation in machines. While they are advantageous for their precision and interpretability, they usually sacrifice some data modeling flexibility for these advantages because they adhere to a manually engineered schema. In this review, we take a natural language processing perspective to the limitations of KBs, examining how they may be addressed in part by training neural contextual language models (LMs) to internalize and express relational knowledge in free-text form. We propose a novel taxonomy for relational knowledge representation in contextual LMs based on the level of KB supervision provided, considering both works that probe LMs for implicit relational knowledge acquired during self-supervised pretraining on unstructured text alone, and works that explicitly supervise LMs at the level of KB entities and/or relations. We conclude that LMs and KBs are complementary representation tools, as KBs provide a high standard of factual precision which can in turn be flexibly and expressively modeled by LMs, and provide suggestions for future research in this direction.
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
This paper describes our approach to the Toxic Spans Detection problem (SemEval-2021 Task 5). We propose BERToxic, a system that fine-tunes a pre-trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. Through experiments, we show that these two post-processing steps improve the performance of our model by 4.16% on the test set. We also studied the effects of data augmentation and ensemble modeling strategies on our system. Our system significantly outperformed the provided baseline and achieved an F1-score of 0.683, placing Lone Pine in the 17th place out of 91 teams in the competition. Our code is made available at https://github.com/Yakoob-Khan/Toxic-Spans-Detection
The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data.
The paper presents our proposed solutions for the MediaEval 2020 Flood-Related Multimedia Task, which aims to analyze and detect flooding events in multimedia content shared over Twitter. In total, we proposed four different solutions including a multi-modal solution combining textual and visual information for the mandatory run, and three single modal image and text-based solutions as optional runs. In the multimodal method, we rely on a supervised multimodal bitransformer model that combines textual and visual features in an early fusion, achieving a micro F1-score of .859 on the development data set. For the text-based flood events detection, we use a transformer network (i.e., pretrained Italian BERT model) achieving an F1-score of .853. For image-based solutions, we employed multiple deep models, pre-trained on both, the ImageNet and places data sets, individually and combined in an early fusion achieving F1-scores of .816 and .805 on the development set, respectively.
While lexico-semantic elements no doubt capture a large amount of linguistic information, it has been argued that they do not capture all information contained in text. This assumption is central to constructionist approaches to language which argue that language consists of constructions, learned pairings of a form and a function or meaning that are either frequent or have a meaning that cannot be predicted from its component parts. BERT's training objectives give it access to a tremendous amount of lexico-semantic information, and while BERTology has shown that BERT captures certain important linguistic dimensions, there have been no studies exploring the extent to which BERT might have access to constructional information. In this work we design several probes and conduct extensive experiments to answer this question. Our results allow us to conclude that BERT does indeed have access to a significant amount of information, much of which linguists typically call constructional information. The impact of this observation is potentially far-reaching as it provides insights into what deep learning methods learn from text, while also showing that information contained in constructions is redundantly encoded in lexico-semantics.