Alert button
Picture for Jiaming Shen

Jiaming Shen

Alert button

On What Basis? Predicting Text Preference Via Structured Comparative Reasoning

Nov 14, 2023
Jing Nathan Yan, Tianqi Liu, Justin T Chiu, Jiaming Shen, Zhen Qin, Yue Yu, Yao Zhao, Charu Lakshmanan, Yair Kurzion, Alexander M. Rush, Jialu Liu, Michael Bendersky

Comparative reasoning plays a crucial role in text preference prediction; however, large language models (LLMs) often demonstrate inconsistencies in their reasoning. While approaches like Chain-of-Thought improve accuracy in many other settings, they struggle to consistently distinguish the similarities and differences of complex texts. We introduce SC, a prompting approach that predicts text preferences by generating structured intermediate comparisons. SC begins by proposing aspects of comparison, followed by generating textual comparisons under each aspect. We select consistent comparisons with a pairwise consistency comparator that ensures each aspect's comparisons clearly distinguish differences between texts, significantly reducing hallucination and improving consistency. Our comprehensive evaluations across various NLP tasks, including summarization, retrieval, and automatic rating, demonstrate that SC equips LLMs to achieve state-of-the-art performance in text preference prediction.

Viaarxiv icon

Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning

Nov 13, 2023
Yue Yu, Jiaming Shen, Tianqi Liu, Zhen Qin, Jing Nathan Yan, Jialu Liu, Chao Zhang, Michael Bendersky

Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks. With only a few demonstration examples, these LLMs can quickly adapt to target tasks without expensive gradient updates. Common strategies to boost such 'in-context' learning ability are to ensemble multiple model decoded results and require the model to generate an explanation along with the prediction. However, these models often treat different class predictions equally and neglect the potential discrepancy between the explanations and predictions. To fully unleash the power of explanations, we propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs. We design two techniques, explanation-guided ensemble, and soft probability aggregation, to mitigate the effect of unreliable explanations and improve the consistency between explanations and final predictions. Experiments on seven natural language understanding tasks and four varying-size LLMs demonstrate the effectiveness of our proposed framework.

Viaarxiv icon

Bridging the Gap: Fine-to-Coarse Sketch Interpolation Network for High-Quality Animation Sketch Inbetweening

Aug 25, 2023
Jiaming Shen, Kun Hu, Wei Bao, Chang Wen Chen, Zhiyong Wang

The 2D animation workflow is typically initiated with the creation of keyframes using sketch-based drawing. Subsequent inbetweens (i.e., intermediate sketch frames) are crafted through manual interpolation for smooth animations, which is a labor-intensive process. Thus, the prospect of automatic animation sketch interpolation has become highly appealing. However, existing video interpolation methods are generally hindered by two key issues for sketch inbetweening: 1) limited texture and colour details in sketches, and 2) exaggerated alterations between two sketch keyframes. To overcome these issues, we propose a novel deep learning method, namely Fine-to-Coarse Sketch Interpolation Network (FC-SIN). This approach incorporates multi-level guidance that formulates region-level correspondence, sketch-level correspondence and pixel-level dynamics. A multi-stream U-Transformer is then devised to characterize sketch inbewteening patterns using these multi-level guides through the integration of both self-attention and cross-attention mechanisms. Additionally, to facilitate future research on animation sketch inbetweening, we constructed a large-scale dataset - STD-12K, comprising 30 sketch animation series in diverse artistic styles. Comprehensive experiments on this dataset convincingly show that our proposed FC-SIN surpasses the state-of-the-art interpolation methods. Our code and dataset will be publicly available.

* 7pages,6figures 
Viaarxiv icon

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

Jun 30, 2023
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, Michael Bendersky

Figure 1 for Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Figure 2 for Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Figure 3 for Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Figure 4 for Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, there has been limited success so far, as researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these ranking formulations, possibly due to the nature of how LLMs are trained. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL2020, PRP based on the Flan-UL2 model with 20B parameters outperforms the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, by over 5% at NDCG@1. On TREC-DL2019, PRP is only inferior to the GPT-4 solution on the NDCG@5 and NDCG@10 metrics, while outperforming other existing solutions, such as InstructGPT which has 175B parameters, by over 10% for nearly all ranking metrics. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. We also discuss other benefits of PRP, such as supporting both generation and scoring LLM APIs, as well as being insensitive to input ordering.

* 12 pages, 3 figures 
Viaarxiv icon

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Jun 28, 2023
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang

Figure 1 for Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Figure 2 for Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Figure 3 for Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Figure 4 for Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. We release the generated dataset and used prompts to facilitate future research. The data and code will be available on \url{https://github.com/yueyu1030/AttrPrompt}.

* Work in progress. A shorter version is accepted to the ICML DMLR workshop 
Viaarxiv icon

Local Boosting for Weakly-Supervised Learning

Jun 05, 2023
Rongzhi Zhang, Yue Yu, Jiaming Shen, Xiquan Cui, Chao Zhang

Figure 1 for Local Boosting for Weakly-Supervised Learning
Figure 2 for Local Boosting for Weakly-Supervised Learning
Figure 3 for Local Boosting for Weakly-Supervised Learning
Figure 4 for Local Boosting for Weakly-Supervised Learning

Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose $\textit{LocalBoost}$, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.

* Accepted by KDD 2023 Research Track 
Viaarxiv icon

ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

May 18, 2023
Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, Chao Zhang

Figure 1 for ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval
Figure 2 for ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval
Figure 3 for ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval
Figure 4 for ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

With the development of large language models (LLMs), zero-shot learning has attracted much attention for various NLP tasks. Different from prior works that generate training data with billion-scale natural language generation (NLG) models, we propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. To realize this, we first conduct contrastive pretraining to learn an unsupervised dense retriever for extracting the most relevant documents using class-descriptive verbalizers. We then further propose two simple strategies, namely Verbalizer Augmentation with Demonstrations and Self-consistency Guided Filtering to improve the topic coverage of the dataset while removing noisy examples. Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models. Besides, REGEN can be naturally integrated with recently proposed large language models to boost performance.

* ACL 2023 Findings (Code: https://github.com/yueyu1030/ReGen) 
Viaarxiv icon

Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation

May 08, 2023
Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Najork, Chao Zhang

Figure 1 for Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation
Figure 2 for Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation
Figure 3 for Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation
Figure 4 for Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation

Knowledge distillation is a popular technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution. Therefore, forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss implicitly transforms the original teacher into a proxy teacher with a distribution closer to the ground truth distribution. We establish the theoretical connection between this "distribution closeness" and the student model generalizability, which enables us to select the PTLoss's perturbation coefficients in a principled way. Extensive experiments on five datasets demonstrate PTLoss can significantly improve the distillation effectiveness for teachers of various scales.

* 16 pages 
Viaarxiv icon

HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting

Apr 12, 2023
Jiaying Lu, Jiaming Shen, Bo Xiong, Wenjing Ma, Steffen Staab, Carl Yang

Figure 1 for HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
Figure 2 for HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
Figure 3 for HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting
Figure 4 for HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting

Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.

* In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Short-Paper Track), 2023  
Viaarxiv icon

"Why is this misleading?": Detecting News Headline Hallucinations with Explanations

Feb 12, 2023
Jiaming Shen, Jialu Liu, Dan Finnie, Negar Rahmati, Michael Bendersky, Marc Najork

Figure 1 for "Why is this misleading?": Detecting News Headline Hallucinations with Explanations
Figure 2 for "Why is this misleading?": Detecting News Headline Hallucinations with Explanations
Figure 3 for "Why is this misleading?": Detecting News Headline Hallucinations with Explanations
Figure 4 for "Why is this misleading?": Detecting News Headline Hallucinations with Explanations

Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled <article, headline> pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations.

* WWW 2023, 12 pages 
Viaarxiv icon