Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiawei Han

UIUC

Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

May 24, 2023

Pengcheng Jiang, Shivam Agarwal, Bowen Jin, Xuan Wang, Jimeng Sun, Jiawei Han

Figure 1 for Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

Figure 2 for Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

Figure 3 for Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

Figure 4 for Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

Abstract:The mission of open knowledge graph (KG) completion is to draw new findings from known facts. Existing works that augment KG completion require either (1) factual triples to enlarge the graph reasoning space or (2) manually designed prompts to extract knowledge from a pre-trained language model (PLM), exhibiting limited performance and requiring expensive efforts from experts. To this end, we propose TAGREAL that automatically generates quality query prompts and retrieves support information from large text corpora to probe knowledge from PLM for KG completion. The results show that TAGREAL achieves state-of-the-art performance on two benchmark datasets. We find that TAGREAL has superb performance even with limited training data, outperforming existing embedding-based, graph-based, and PLM-based methods.

* 18 pages, 11 figures, 8 tables. Accepted by ACL 23' Findings

Via

Access Paper or Ask Questions

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

May 23, 2023

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang

Abstract:Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) in providing appropriate outputs based on input instructions. However, existing methods for collecting instruction-tuning data suffer from limitations in scalability and affordability. In this paper, we propose Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon the metadata of existing NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs. Dynosaur offers several advantages: 1) lower generation costs (less than $12 for generating 800K instruction-tuning data), 2) good quality of instruction-tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI with comparable data sizes), and 3) the ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform. We further investigate continual learning as an approach to learning with the ever-growing instruction-tuning dataset. We demonstrate that replay methods not only help mitigate forgetting issues but help generalize to unseen tasks better. As a novel continual learning scenario for instruction tuning, selecting tasks based on instruction representations can be an effective replaying strategy. Code and data are released at \url{https://github.com/WadeYin9712/Dynosaur}.

* Work in progress

Via

Access Paper or Ask Questions

PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training

May 23, 2023

Yunyi Zhang, Minhao Jiang, Yu Meng, Yu Zhang, Jiawei Han

Figure 1 for PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training

Figure 2 for PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training

Figure 3 for PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training

Figure 4 for PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training

Abstract:Recently proposed weakly-supervised text classification settings train a classifier using the label name of each target class as the only supervision. Such weakly-supervised settings have been gaining increasing attention since they can largely reduce human annotation efforts compared to fully-supervised and semi-supervised settings. Most existing methods follow the strategy that first uses the label names as static features to generate pseudo labels, which are then used for classifier training. While reasonable, such a commonly adopted framework suffers from two limitations: (1) words can have different meanings in different contexts, so using label names for context-free matching can induce very noisy pseudo labels; and (2) the errors made in the pseudo label generation stage will directly propagate to the classifier training stage without a chance of being corrected. In this paper, we propose a new method, PromptClass, consisting of two modules: (1) a pseudo label acquisition module that uses zero-shot prompting of pre-trained language models (PLM) to get pseudo labels based on contextualized text understanding, and (2) a noise-robust self-training module that iteratively trains the classifier and updates pseudo labels by utilizing two PLM fine-tuning strategies that regularize each other. Extensive experiments show that PromptClass achieves overall better performance than existing strong baselines on four benchmark datasets and even achieves similar performance to fully-supervised classifiers on sentiment classification tasks.

Via

Access Paper or Ask Questions

OntoType: Ontology-Guided Zero-Shot Fine-Grained Entity Typing with Weak Supervision from Pre-Trained Language Models

May 21, 2023

Tanay Komarlu, Minhao Jiang, Xuan Wang, Jiawei Han

Abstract:Fine-grained entity typing (FET), which assigns entities in text with context-sensitive, fine-grained semantic types, will play an important role in natural language understanding. A supervised FET method, which typically relies on human-annotated corpora for training, is costly and difficult to scale. Recent studies leverage pre-trained language models (PLMs) to generate rich and context-aware weak supervision for FET. However, a PLM may still generate a mixture of rough and fine-grained types, or tokens unsuitable for typing. In this study, we vision that an ontology provides a semantics-rich, hierarchical structure, which will help select the best results generated by multiple PLM models and head words. Specifically, we propose a novel zero-shot, ontology-guided FET method, OntoType, which follows a type ontological structure, from coarse to fine, ensembles multiple PLM prompting results to generate a set of type candidates, and refines its type resolution, under the local context with a natural language inference model. Our experiments on the Ontonotes, FIGER, and NYT datasets using their associated ontological structures demonstrate that our method outperforms the state-of-the-art zero-shot fine-grained entity typing methods. Our error analysis shows that refinement of the existing ontology structures will further improve fine-grained entity typing.

Via

Access Paper or Ask Questions

Patton: Language Model Pretraining on Text-Rich Networks

May 20, 2023

Bowen Jin, Wentao Zhang, Yu Zhang, Yu Meng, Xinyang Zhang, Qi Zhu, Jiawei Han

Abstract:A real-world text corpus sometimes comprises not only text documents but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships). Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework Patton. Patton includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where Patton outperforms baselines significantly and consistently.

* ACL 2023. (Code: https://github.com/PeterGriffinJin/Patton)

Via

Access Paper or Ask Questions

Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

May 04, 2023

Susik Yoon, Dongha Lee, Yunyi Zhang, Jiawei Han

Figure 1 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 2 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 3 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 4 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Abstract:Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.

* Accepted by SIGIR'23

Via

Access Paper or Ask Questions

MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Apr 04, 2023

Priyanka Kargupta, Tanay Komarlu, Susik Yoon, Xuan Wang, Jiawei Han

Figure 1 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 2 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 3 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 4 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Abstract:Text classification typically requires a substantial amount of human-annotated data to serve as supervision, which is costly to obtain in dynamic emerging domains. Certain methods seek to address this problem by solely relying on the surface text of class names to serve as extremely weak supervision. However, existing methods fail to account for single-class documents discussing multiple topics. Both topic diversity and vague sentences may introduce noise into the document's underlying representation and consequently the precision of the predicted class. Furthermore, current work focuses on text granularities (documents, sentences, or words) independently, which limits the degree of coarse- or fine-grained context that we can jointly extract from all three to identify significant subtext for classification. In order to address this problem, we propose MEGClass, an extremely weakly-supervised text classification method to exploit Mutually-Enhancing Text Granularities. Specifically, MEGClass constructs class-oriented sentence and class representations based on keywords for performing a sentence-level confidence-weighted label ensemble in order to estimate a document's initial class distribution. This serves as the target distribution for a multi-head attention network with a class-weighted contrastive loss. This network learns contextualized sentence representations and weights to form document representations that reflect its original document and sentence-level topic diversity. Retaining this heterogeneity allows MEGClass to select the most class-indicative documents to serve as iterative feedback for enhancing the class representations. Finally, these top documents are used to fine-tune a pre-trained text classifier. As demonstrated through extensive experiments on six benchmark datasets, MEGClass outperforms other weakly and extremely weakly supervised methods.

* Code: https://github.com/pkargupta/MEGClass/

Via

Access Paper or Ask Questions

GLEN: General-Purpose Event Detection for Thousands of Types

Mar 20, 2023

Qiusi Zhan, Sha Li, Kathryn Conger, Martha Palmer, Heng Ji, Jiawei Han

Figure 1 for GLEN: General-Purpose Event Detection for Thousands of Types

Figure 2 for GLEN: General-Purpose Event Detection for Thousands of Types

Figure 3 for GLEN: General-Purpose Event Detection for Thousands of Types

Figure 4 for GLEN: General-Purpose Event Detection for Thousands of Types

Abstract:The development of event extraction systems has been hindered by the absence of wide-coverage, large-scale datasets. To make event extraction systems more accessible, we build a general-purpose event detection dataset GLEN, which covers 3,465 different event types, making it over 20x larger in ontology than any current dataset. GLEN is created by utilizing the DWD Overlay, which provides a mapping between Wikidata Qnodes and PropBank rolesets. This enables us to use the abundant existing annotation for PropBank as distant supervision. In addition, we also propose a new multi-stage event detection model specifically designed to handle the large ontology size and partial labels in GLEN. We show that our model exhibits superior performance (~10% F1 gain) compared to both conventional classification baselines and newer definition-based models. Finally, we perform error analysis and show that label noise is still the largest challenge for improving performance.

* The first two authors contributed equally. (15 pages, 11 figures)

Via

Access Paper or Ask Questions

Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

Feb 21, 2023

Bowen Jin, Yu Zhang, Yu Meng, Jiawei Han

Figure 1 for Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

Figure 2 for Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

Figure 3 for Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

Figure 4 for Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

Abstract:Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers, a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node's ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively.

* ICLR 2023. (Code: https://github.com/PeterGriffinJin/Edgeformers)

Via

Access Paper or Ask Questions

PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Feb 10, 2023

Susik Yoon, Hou Pong Chan, Jiawei Han

Figure 1 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 2 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 3 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 4 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Abstract:Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.

* Accepted by WWW'23

Via

Access Paper or Ask Questions