Alert button
Picture for Wenlong Zhao

Wenlong Zhao

Alert button

Multistage Collaborative Knowledge Distillation from Large Language Models

Nov 15, 2023
Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, Benjamin Rozonoyer, Md Arafat Sultan, Jay-Yoon Lee, Mohit Iyyer, Andrew McCallum

We study semi-supervised sequence prediction tasks where labeled data are too scarce to effectively finetune a model and at the same time few-shot prompting of a large language model (LLM) has suboptimal performance. This happens when a task, such as parsing, is expensive to annotate and also unfamiliar to a pretrained LLM. In this paper, we present a discovery that student models distilled from a prompted LLM can often generalize better than their teacher on such tasks. Leveraging this finding, we propose a new distillation method, multistage collaborative knowledge distillation from an LLM (MCKD), for such tasks. MCKD first prompts an LLM using few-shot in-context learning to produce pseudolabels for unlabeled data. Then, at each stage of distillation, a pair of students are trained on disjoint partitions of the pseudolabeled data. Each student subsequently produces new and improved pseudolabels for the unseen partition to supervise the next round of student(s) with. We show the benefit of multistage cross-partition labeling on two constituency parsing tasks. On CRAFT biomedical parsing, 3-stage MCKD with 50 labeled examples matches the performance of supervised finetuning with 500 examples and outperforms the prompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively.

Viaarxiv icon

Editing Commonsense Knowledge in GPT

May 24, 2023
Anshita Gupta, Debanjan Mondal, Akshay Krishna Sheshadri, Wenlong Zhao, Xiang Lorraine Li, Sarah Wiegreffe, Niket Tandon

Figure 1 for Editing Commonsense Knowledge in GPT
Figure 2 for Editing Commonsense Knowledge in GPT
Figure 3 for Editing Commonsense Knowledge in GPT
Figure 4 for Editing Commonsense Knowledge in GPT

Memory editing methods for updating encyclopedic knowledge in transformers have received increasing attention for their efficacy, specificity, and generalization advantages. However, it remains unclear if such methods can be adapted for the more nuanced domain of commonsense knowledge. We propose $MEMIT_{CSK}$, an adaptation of MEMIT to edit commonsense mistakes in GPT-2 Large and XL. We extend editing to various token locations and employ a robust layer selection strategy. Models edited by $MEMIT_{CSK}$ outperforms the fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and 20Q. We further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that contains unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges. $MEMIT_{CSK}$ demonstrates favorable semantic generalization, outperforming fine-tuning baselines by 13.72% and 5.57% overall scores on MEMIT-CSK-PROBE. These results suggest a compelling future direction of incorporating context-specific user feedback concerning commonsense in GPT by direct model editing, rectifying and customizing model behaviors via human-in-the-loop systems.

* Code and data is available at https://github.com/anshitag/memit_csk 
Viaarxiv icon

SAD: Semi-Supervised Anomaly Detection on Dynamic Graphs

May 23, 2023
Sheng Tian, Jihai Dong, Jintang Li, Wenlong Zhao, Xiaolong Xu, Baokun wang, Bowen Song, Changhua Meng, Tianyi Zhang, Liang Chen

Figure 1 for SAD: Semi-Supervised Anomaly Detection on Dynamic Graphs
Figure 2 for SAD: Semi-Supervised Anomaly Detection on Dynamic Graphs
Figure 3 for SAD: Semi-Supervised Anomaly Detection on Dynamic Graphs
Figure 4 for SAD: Semi-Supervised Anomaly Detection on Dynamic Graphs

Anomaly detection aims to distinguish abnormal instances that deviate significantly from the majority of benign ones. As instances that appear in the real world are naturally connected and can be represented with graphs, graph neural networks become increasingly popular in tackling the anomaly detection problem. Despite the promising results, research on anomaly detection has almost exclusively focused on static graphs while the mining of anomalous patterns from dynamic graphs is rarely studied but has significant application value. In addition, anomaly detection is typically tackled from semi-supervised perspectives due to the lack of sufficient labeled data. However, most proposed methods are limited to merely exploiting labeled data, leaving a large number of unlabeled samples unexplored. In this work, we present semi-supervised anomaly detection (SAD), an end-to-end framework for anomaly detection on dynamic graphs. By a combination of a time-equipped memory bank and a pseudo-label contrastive learning module, SAD is able to fully exploit the potential of large unlabeled samples and uncover underlying anomalies on evolving graph streams. Extensive experiments on four real-world datasets demonstrate that SAD efficiently discovers anomalies from dynamic graphs and outperforms existing advanced methods even when provided with only little labeled data.

* Accepted to IJCAI'23. Code will be available at https://github.com/D10Andy/SAD 
Viaarxiv icon

GRANDE: a neural model over directed multigraphs with application to anti-money laundering

Feb 04, 2023
Ruofan Wu, Boqun Ma, Hong Jin, Wenlong Zhao, Weiqiang Wang, Tianyi Zhang

Figure 1 for GRANDE: a neural model over directed multigraphs with application to anti-money laundering
Figure 2 for GRANDE: a neural model over directed multigraphs with application to anti-money laundering
Figure 3 for GRANDE: a neural model over directed multigraphs with application to anti-money laundering
Figure 4 for GRANDE: a neural model over directed multigraphs with application to anti-money laundering

The application of graph representation learning techniques to the area of financial risk management (FRM) has attracted significant attention recently. However, directly modeling transaction networks using graph neural models remains challenging: Firstly, transaction networks are directed multigraphs by nature, which could not be properly handled with most of the current off-the-shelf graph neural networks (GNN). Secondly, a crucial problem in FRM scenarios like anti-money laundering (AML) is to identify risky transactions and is most naturally cast into an edge classification problem with rich edge-level features, which are not fully exploited by the prevailing GNN design that follows node-centric message passing protocols. In this paper, we present a systematic investigation of design aspects of neural models over directed multigraphs and develop a novel GNN protocol that overcomes the above challenges via efficiently incorporating directional information, as well as proposing an enhancement that targets edge-related tasks using a novel message passing scheme over an extension of edge-to-node dual graph. A concrete GNN architecture called GRANDE is derived using the proposed protocol, with several further improvements and generalizations to temporal dynamic graphs. We apply the GRANDE model to both a real-world anti-money laundering task and public datasets. Experimental evaluations show the superiority of the proposed GRANDE architecture over recent state-of-the-art models on dynamic graph modeling and directed graph modeling.

* Accepted as regular paper at ICDM 2022 
Viaarxiv icon

ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction

Oct 17, 2022
Weiwen Xu, Yang Deng, Wenqiang Lei, Wenlong Zhao, Tat-Seng Chua, Wai Lam

Figure 1 for ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction
Figure 2 for ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction
Figure 3 for ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction
Figure 4 for ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction

We study automatic Contract Clause Extraction (CCE) by modeling implicit relations in legal contracts. Existing CCE methods mostly treat contracts as plain text, creating a substantial barrier to understanding contracts of high complexity. In this work, we first comprehensively analyze the complexity issues of contracts and distill out three implicit relations commonly found in contracts, namely, 1) Long-range Context Relation that captures the correlations of distant clauses; 2) Term-Definition Relation that captures the relation between important terms with their corresponding definitions; and 3) Similar Clause Relation that captures the similarities between clauses of the same type. Then we propose a novel framework ConReader to exploit the above three relations for better contract understanding and improving CCE. Experimental results show that ConReader makes the prediction more interpretable and achieves new state-of-the-art on two CCE tasks in both conventional and zero-shot settings.

* To appear at EMNLP 2022 main conference 
Viaarxiv icon

ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution

Oct 13, 2022
Ankita Gupta, Marzena Karpinska, Wenlong Zhao, Kalpesh Krishna, Jack Merullo, Luke Yeh, Mohit Iyyer, Brendan O'Connor

Figure 1 for ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Figure 2 for ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Figure 3 for ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Figure 4 for ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution

Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (>90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.

* preprint (19 pages), code in https://github.com/gnkitaa/ezCoref 
Viaarxiv icon

Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression

Nov 19, 2021
Yuezhou Sun, Wenlong Zhao, Lijun Zhang, Xiao Liu, Hui Guan, Matei Zaharia

Figure 1 for Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression
Figure 2 for Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression
Figure 3 for Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression
Figure 4 for Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression

This paper investigates deep neural network (DNN) compression from the perspective of compactly representing and storing trained parameters. We explore the previously overlooked opportunity of cross-layer architecture-agnostic representation sharing for DNN parameters. To do this, we decouple feedforward parameters from DNN architectures and leverage additive quantization, an extreme lossy compression method invented for image descriptors, to compactly represent the parameters. The representations are then finetuned on task objectives to improve task accuracy. We conduct extensive experiments on MobileNet-v2, VGG-11, ResNet-50, Feature Pyramid Networks, and pruned DNNs trained for classification, detection, and segmentation tasks. The conceptually simple scheme consistently outperforms iterative unstructured pruning. Applied to ResNet-50 with 76.1% top-1 accuracy on the ILSVRC12 classification challenge, it achieves a $7.2\times$ compression ratio with no accuracy loss and a $15.3\times$ compression ratio at 74.79% accuracy. Further analyses suggest that representation sharing can frequently happen across network layers and that learning shared representations for an entire DNN can achieve better accuracy at the same compression ratio than compressing the model as multiple separate parts. We release PyTorch code to facilitate DNN deployment on resource-constrained devices and spur future research on efficient representations and storage of DNN parameters.

Viaarxiv icon

IGA : An Intent-Guided Authoring Assistant

Apr 14, 2021
Simeng Sun, Wenlong Zhao, Varun Manjunatha, Rajiv Jain, Vlad Morariu, Franck Dernoncourt, Balaji Vasan Srinivasan, Mohit Iyyer

Figure 1 for IGA : An Intent-Guided Authoring Assistant
Figure 2 for IGA : An Intent-Guided Authoring Assistant
Figure 3 for IGA : An Intent-Guided Authoring Assistant
Figure 4 for IGA : An Intent-Guided Authoring Assistant

While large-scale pretrained language models have significantly improved writing assistance functionalities such as autocomplete, more complex and controllable writing assistants have yet to be explored. We leverage advances in language modeling to build an interactive writing assistant that generates and rephrases text according to fine-grained author specifications. Users provide input to our Intent-Guided Assistant (IGA) in the form of text interspersed with tags that correspond to specific rhetorical directives (e.g., adding description or contrast, or rephrasing a particular sentence). We fine-tune a language model on a dataset heuristically-labeled with author intent, which allows IGA to fill in these tags with generated text that users can subsequently edit to their liking. A series of automatic and crowdsourced evaluations confirm the quality of IGA's generated outputs, while a small-scale user study demonstrates author preference for IGA over baseline methods in a creative writing task. We release our dataset, code, and demo to spur further research into AI-assisted writing.

* 13 pages 
Viaarxiv icon

Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings

Oct 10, 2020
Prafull Prakash, Saurabh Kumar Shashidhar, Wenlong Zhao, Subendhu Rongali, Haidar Khan, Michael Kayser

Figure 1 for Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
Figure 2 for Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
Figure 3 for Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
Figure 4 for Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings

The current state-of-the-art task-oriented semantic parsing models use BERT or RoBERTa as pretrained encoders; these models have huge memory footprints. This poses a challenge to their deployment for voice assistants such as Amazon Alexa and Google Assistant on edge devices with limited memory budgets. We propose to learn compositional code embeddings to greatly reduce the sizes of BERT-base and RoBERTa-base. We also apply the technique to DistilBERT, ALBERT-base, and ALBERT-large, three already compressed BERT variants which attain similar state-of-the-art performances on semantic parsing with much smaller model sizes. We observe 95.15% ~ 98.46% embedding compression rates and 20.47% ~ 34.22% encoder compression rates, while preserving greater than 97.5% semantic parsing performances. We provide the recipe for training and analyze the trade-off between code embedding sizes and downstream performances.

* Accepted at EMNLP 2020 (Findings); 7 Pages 
Viaarxiv icon

Rethinking Exposure Bias In Language Modeling

Oct 13, 2019
Yifan Xu, Kening Zhang, Haoyu Dong, Yuezhou Sun, Wenlong Zhao, Zhuowen Tu

Figure 1 for Rethinking Exposure Bias In Language Modeling
Figure 2 for Rethinking Exposure Bias In Language Modeling
Figure 3 for Rethinking Exposure Bias In Language Modeling
Figure 4 for Rethinking Exposure Bias In Language Modeling

Exposure bias describes the phenomenon that a language model trained under the teacher forcing schema may perform poorly at the inference stage when its predictions are conditioned on its previous predictions unseen from the training corpus. Recently, several generative adversarial networks (GANs) and reinforcement learning (RL) methods have been introduced to alleviate this problem. Nonetheless, a common issue in RL and GANs training is the sparsity of reward signals. In this paper, we adopt two simple strategies, multi-range reinforcing, and multi-entropy sampling, to amplify and denoise the reward signal. Our model produces an improvement over competing models with regards to BLEU scores and road exam, a new metric we designed to measure the robustness against exposure bias in language models.

Viaarxiv icon