Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context.To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts.
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances.
Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information. Recent works have been devoted to leveraging text summarization and have achieved promising results. However, these summarization-based methods did not take full advantage of the summary including ignoring the inherent interactions between the summary and document. As a result, they limited the representation to express major points in the document, which is highly indicative of the key sentiment. In this paper, we study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA. A Hierarchical Interaction Networks (HIN) is proposed to explore bidirectional interactions between the summary and document at multiple granularities and learn subject-oriented document representations for sentiment classification. Furthermore, we design a Sentiment-based Rethinking mechanism (SR) by refining the HIN with sentiment label information to learn a more sentiment-aware document representation. We extensively evaluate our proposed models on three public datasets. The experimental results consistently demonstrate the effectiveness of our proposed models and show that HIN-SR outperforms various state-of-the-art methods.
While several state-of-the-art approaches to dialogue state tracking (DST) have shown promising performances on several benchmarks, there is still a significant performance gap between seen slot values (i.e., values that occur in both training set and test set) and unseen ones (values that occur in training set but not in test set). Recently, the copy-mechanism has been widely used in DST models to handle unseen slot values, which copies slot values from user utterance directly. In this paper, we aim to find out the factors that influence the generalization ability of a common copy-mechanism model for DST. Our key observations include: 1) the copy-mechanism tends to memorize values rather than infer them from contexts, which is the primary reason for unsatisfactory generalization performance; 2) greater diversity of slot values in the training set increase the performance on unseen values but slightly decrease the performance on seen values. Moreover, we propose a simple but effective algorithm of data augmentation to train copy-mechanism models, which augments the input dataset by copying user utterances and replacing the real slot values with randomly generated strings. Users could use two hyper-parameters to realize a trade-off between the performances on seen values and unseen ones, as well as a trade-off between overall performance and computational cost. Experimental results on three widely used datasets (WoZ 2.0, DSTC2, and Multi-WoZ 2.0) show the effectiveness of our approach.
Automatically labeling multiple styles for every song is a comprehensive application in all kinds of music websites. Recently, some researches explore review-driven multi-label music style classification and exploit style correlations for this task. However, their methods focus on mining the statistical relations between different music styles and only consider shallow style relations. Moreover, these statistical relations suffer from the underfitting problem because some music styles have little training data. To tackle these problems, we propose a novel knowledge relations integrated framework (KRF) to capture the complete style correlations, which jointly exploits the inherent relations between music styles according to external knowledge and their statistical relations. Based on the two types of relations, we use graph convolutional network to learn the deep correlations between styles automatically. Experimental results show that our framework significantly outperforms state-of-the-art methods. Further studies demonstrate that our framework can effectively alleviate the underfitting problem and learn meaningful style correlations.
The development of social media has revolutionized the way people communicate, share information and make decisions, but it also provides an ideal platform for publishing and spreading rumors. Existing rumor detection methods focus on finding clues from text content, user profiles, and propagation patterns. However, the local semantic relation and global structural information in the message propagation graph have not been well utilized by previous works. In this paper, we present a novel global-local attention network (GLAN) for rumor detection, which jointly encodes the local semantic and global structural information. We first generate a better integrated representation for each source tweet by fusing the semantic information of related retweets with the attention mechanism. Then, we model the global relationships among all source tweets, retweets, and users as a heterogeneous graph to capture the rich structural information for rumor detection. We conduct experiments on three real-world datasets, and the results demonstrate that GLAN significantly outperforms the state-of-the-art models in both rumor detection and early detection scenarios.
Opinion spam has become a widespread problem in social media, where hired spammers write deceptive reviews to promote or demote products to mislead the consumers for profit or fame. Existing works mainly focus on manually designing discrete textual or behavior features, which cannot capture complex semantics of reviews. Although recent works apply deep learning methods to learn review-level semantic features, their models ignore the impact of the user-level and product-level information on learning review semantics and the inherent user-review-product relationship information. In this paper, we propose a Hierarchical Fusion Attention Network (HFAN) to automatically learn the semantics of reviews from the user and product level. Specifically, we design a multi-attention unit to extract user(product)-related review information. Then, we use orthogonal decomposition and fusion attention to learn a user, review, and product representation from the review information. Finally, we take the review as a relation between user and product entity and apply TransH to jointly encode this relationship into review representation. Experimental results obtained more than 10\% absolute precision improvement over the state-of-the-art performances on four real-world datasets, which show the effectiveness and versatility of the model.
This paper focuses on the task of generating long structured sentences with explicit discourse markers, by proposing a new task Sentence Transfer and a novel model architecture TransSent. Previous works on text generation fused semantic and structure information in one mixed hidden representation. However, the structure was difficult to maintain properly when the generated sentence became longer. In this work, we explicitly separate the modeling process of semantic information and structure information. Intuitively, humans produce long sentences by directly connecting discourses with discourse markers like and, but, etc. We thus define a new task called Sentence Transfer. This task represents a long sentence as (head discourse, discourse marker, tail discourse) and aims at tail discourse generation based on head discourse and discourse marker. Then, by connecting original head discourse and generated tail discourse with a discourse marker, we generate a long structured sentence. We also propose a model architecture called TransSent, which models relations between two discourses by interpreting them as transferring from one discourse to the other in the embedding space. Experiment results show that our model achieves better performance in automatic evaluations, and can generate structured sentences with high quality. The datasets can be accessed by https://github.com/1024er/TransSent dataset.
This paper focuses on the task of sentiment transfer on non-parallel text, which modifies sentiment attributes (e.g., positive or negative) of sentences while preserving their attribute-independent content. Due to the limited capability of RNNbased encoder-decoder structure to capture deep and long-range dependencies among words, previous works can hardly generate satisfactory sentences from scratch. When humans convert the sentiment attribute of a sentence, a simple but effective approach is to only replace the original sentimental tokens in the sentence with target sentimental expressions, instead of building a new sentence from scratch. Such a process is very similar to the task of Text Infilling or Cloze, which could be handled by a deep bidirectional Masked Language Model (e.g. BERT). So we propose a two step approach "Mask and Infill". In the mask step, we separate style from content by masking the positions of sentimental tokens. In the infill step, we retrofit MLM to Attribute Conditional MLM, to infill the masked positions by predicting words or phrases conditioned on the context1 and target sentiment. We evaluate our model on two review datasets with quantitative, qualitative, and human evaluations. Experimental results demonstrate that our models improve state-of-the-art performance.
Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like "but", "though", "while", etc, and the head discourse and the tail discourse 3 usually indicate opposite emotional tendencies. Based on this observation, we propose a novel plug-and-play method, which first samples discourses according to transitional discourse markers and then validates sentimental polarities with the help of a pretrained attention-based model. Our method increases sample diversity in the first place, can serve as a upstream preprocessing part in data augmentation. We conduct experiments on three public sentiment datasets, with several frequently used algorithms. Results show that our method is found to be consistently effective, even in highly imbalanced scenario, and easily be integrated with oversampling method to boost the performance on imbalanced sentiment classification.