Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Yu

Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Jan 25, 2024

Yanda Chen, Chandan Singh, Xiaodong Liu, Simiao Zuo, Bin Yu, He He, Jianfeng Gao

Figure 1 for Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Figure 2 for Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Figure 3 for Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Figure 4 for Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning

Abstract:Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate inconsistent explanations on different inputs. For example, an LLM may generate the explanation "all birds can fly" when answering the question "Can sparrows fly?" but meanwhile answer "no" to the related question "Can penguins fly?". Explanations should be consistent across related examples so that they allow a human to simulate the LLM's decision process on multiple examples. We propose explanation-consistency finetuning (EC-finetuning), a method that adapts LLMs to generate more consistent natural-language explanations on related examples. EC-finetuning involves finetuning LLMs on synthetic data that is carefully constructed to contain consistent explanations. Across a variety of question-answering datasets in various domains, EC-finetuning yields a 10.0% relative explanation consistency improvement on four finetuning datasets, and generalizes to seven out-of-distribution datasets not seen during finetuning (+4.5% relative). Code is available at https://github.com/yandachen/explanation-consistency-finetuning .

* arXiv admin note: text overlap with arXiv:2307.08678

Via

Access Paper or Ask Questions

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

Nov 03, 2023

Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao

Abstract:In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need - steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA - Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .

* 16 pages

Via

Access Paper or Ask Questions

Using Experience Classification for Training Non-Markovian Tasks

Oct 18, 2023

Ruixuan Miao, Xu Lu, Cong Tian, Bin Yu, Zhenhua Duan

Figure 1 for Using Experience Classification for Training Non-Markovian Tasks

Figure 2 for Using Experience Classification for Training Non-Markovian Tasks

Figure 3 for Using Experience Classification for Training Non-Markovian Tasks

Figure 4 for Using Experience Classification for Training Non-Markovian Tasks

Abstract:Unlike the standard Reinforcement Learning (RL) model, many real-world tasks are non-Markovian, whose rewards are predicated on state history rather than solely on the current state. Solving a non-Markovian task, frequently applied in practical applications such as autonomous driving, financial trading, and medical diagnosis, can be quite challenging. We propose a novel RL approach to achieve non-Markovian rewards expressed in temporal logic LTL$_f$ (Linear Temporal Logic over Finite Traces). To this end, an encoding of linear complexity from LTL$_f$ into MDPs (Markov Decision Processes) is introduced to take advantage of advanced RL algorithms. Then, a prioritized experience replay technique based on the automata structure (semantics equivalent to LTL$_f$ specification) is utilized to improve the training process. We empirically evaluate several benchmark problems augmented with non-Markovian tasks to demonstrate the feasibility and effectiveness of our approach.

Via

Access Paper or Ask Questions

Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

Sep 19, 2023

Keru Wu, Yuansi Chen, Wooseok Ha, Bin Yu

Abstract:Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, and Camelyon17 datasets.

Via

Access Paper or Ask Questions

TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

Sep 13, 2023

Siyao Zhang, Daocheng Fu, Zhao Zhang, Bin Yu, Pinlong Cai

Figure 1 for TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

Figure 2 for TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

Figure 3 for TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

Figure 4 for TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

Abstract:With the promotion of chatgpt to the public, Large language models indeed showcase remarkable common sense, reasoning, and planning skills, frequently providing insightful guidance. These capabilities hold significant promise for their application in urban traffic management and control. However, LLMs struggle with addressing traffic issues, especially processing numerical data and interacting with simulations, limiting their potential in solving traffic-related challenges. In parallel, specialized traffic foundation models exist but are typically designed for specific tasks with limited input-output interactions. Combining these models with LLMs presents an opportunity to enhance their capacity for tackling complex traffic-related problems and providing insightful suggestions. To bridge this gap, we present TrafficGPT, a fusion of ChatGPT and traffic foundation models. This integration yields the following key enhancements: 1) empowering ChatGPT with the capacity to view, analyze, process traffic data, and provide insightful decision support for urban transportation system management; 2) facilitating the intelligent deconstruction of broad and complex tasks and sequential utilization of traffic foundation models for their gradual completion; 3) aiding human decision-making in traffic control through natural language dialogues; and 4) enabling interactive feedback and solicitation of revised outcomes. By seamlessly intertwining large language model and traffic expertise, TrafficGPT not only advances traffic management but also offers a novel approach to leveraging AI capabilities in this domain. The TrafficGPT demo can be found in https://github.com/lijlansg/TrafficGPT.git.

Via

Access Paper or Ask Questions

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Aug 06, 2023

Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu

Figure 1 for The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Figure 2 for The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Abstract:In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.

Via

Access Paper or Ask Questions

Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

Jul 08, 2023

Robin Netzorg, Jiaxun Li, Bin Yu

Figure 1 for Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

Figure 2 for Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

Figure 3 for Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

Figure 4 for Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

Abstract:In recent years, work has gone into developing deep interpretable methods for image classification that clearly attributes a model's output to specific features of the data. One such of these methods is the prototypical part network (ProtoPNet), which attempts to classify images based on meaningful parts of the input. While this method results in interpretable classifications, this method often learns to classify from spurious or inconsistent parts of the image. Hoping to remedy this, we take inspiration from the recent developments in Reinforcement Learning with Human Feedback (RLHF) to fine-tune these prototypes. By collecting human annotations of prototypes quality via a 1-5 scale on the CUB-200-2011 dataset, we construct a reward model that learns to identify non-spurious prototypes. In place of a full RL update, we propose the reweighted, reselected, and retrained prototypical part network (R3-ProtoPNet), which adds an additional three steps to the ProtoPNet training loop. The first two steps are reward-based reweighting and reselection, which align prototypes with human feedback. The final step is retraining to realign the model's features with the updated prototypes. We find that R3-ProtoPNet improves the overall consistency and meaningfulness of the prototypes, but lower the test predictive accuracy when used independently. When multiple R3-ProtoPNets are incorporated into an ensemble, we find an increase in test predictive performance while maintaining interpretability.

Via

Access Paper or Ask Questions

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

Jul 05, 2023

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee

Abstract:Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To address this issue, several approaches have been proposed to reduce computational cost using early-exit strategies. These strategies enable faster text generation using reduced computation without applying the full computation graph to each token. While existing token-level early exit methods show promising results for online inference, they cannot be readily applied for batch inferencing and Key-Value caching. This is because they have to wait until the last token in a batch exits before they can stop computing. This severely limits the practical application of such techniques. In this paper, we propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens. Rather than terminating computation prematurely as in prior works, our approach bypasses lower to middle layers, devoting most of the computational resources to upper layers, allowing later tokens to benefit from the compute expenditure by earlier tokens. Our experimental results show that SkipDecode can obtain 2x to 5x inference speedups with negligible regression across a variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7 billion parameters, all the while being directly compatible with batching and KV caching optimization techniques.

Via

Access Paper or Ask Questions

MDI+: A Flexible Random Forest-Based Feature Importance Framework

Jul 04, 2023

Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

Figure 1 for MDI+: A Flexible Random Forest-Based Feature Importance Framework

Figure 2 for MDI+: A Flexible Random Forest-Based Feature Importance Framework

Figure 3 for MDI+: A Flexible Random Forest-Based Feature Importance Framework

Figure 4 for MDI+: A Flexible Random Forest-Based Feature Importance Framework

Abstract:Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and $R^2$ metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. We further provide guidance on how practitioners can choose an appropriate GLM and metric based upon the Predictability, Computability, Stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. We also apply MDI+ to two real-world case studies on drug response prediction and breast cancer subtype classification. We show that MDI+ extracts well-established predictive genes with significantly greater stability compared to existing feature importance measures. All code and models are released in a full-fledged python package on Github.

Via

Access Paper or Ask Questions

An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

May 27, 2023

Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Briton Park, Tristan Naumann, Anobel Y. Odisho, Bin Yu

Figure 1 for An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Figure 2 for An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Figure 3 for An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Figure 4 for An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Abstract:Pre-trained transformer models have demonstrated success across many natural language processing (NLP) tasks. In applying these models to the clinical domain, a prevailing assumption is that pre-training language models from scratch on large-scale biomedical data results in substantial improvements. We test this assumption with 4 pathology classification tasks on a corpus of 2907 prostate cancer pathology reports. We evaluate 5 transformer pre-trained models that are the same size but differ in pre-training corpora. Specifically, we analyze 3 categories of models: 1)General-domain: BERT and Turing Natural Language Representation (TNLR) models, which use general corpora for pre-training, 2)Mixed-domain: BioBERT which is obtained from BERT by including PubMed abstracts in pre-training and Clinical BioBERT which additionally includes MIMIC-III clinical notes and 3)Domain-specific: PubMedBERT which is pre-trained from scratch on PubMed abstracts. We find the mixed-domain and domain-specific models exhibit faster feature disambiguation during fine-tuning. However, the domain-specific model, PubMedBERT, can overfit to minority classes when presented with class imbalance, a common scenario in pathology report data. At the same time, the mixed-domain models are more resistant to overfitting. Our findings indicate that the use of general natural language and domain-specific corpora in pre-training serve complementary purposes for pathology report classification. The first enables resistance to overfitting when fine-tuning on an imbalanced dataset while the second allows for more accurate modelling of the fine-tuning domain. An expert evaluation is also conducted to reveal common outlier modes of each model. Our results could inform better fine-tuning practices in the clinical domain, to possibly leverage the benefits of mixed-domain models for imbalanced downstream datasets.

Via

Access Paper or Ask Questions