Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Yu

Veridical Data Science for Medical Foundation Models

Sep 15, 2024

Ahmed Alaa, Bin Yu

Figure 1 for Veridical Data Science for Medical Foundation Models

Abstract:The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.

Via

Access Paper or Ask Questions

Mechanistic Interpretation through Contextual Decomposition in Transformers

Jul 01, 2024

Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Anobel Y. Odisho, Peter R. Carroll, Bin Yu

Figure 1 for Mechanistic Interpretation through Contextual Decomposition in Transformers

Figure 2 for Mechanistic Interpretation through Contextual Decomposition in Transformers

Abstract:Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model's outputs compared to alternative interpretation methods such as SHAP and LIME.

Via

Access Paper or Ask Questions

The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

Jun 28, 2024

Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

Figure 1 for The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

Figure 2 for The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

Figure 3 for The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

Figure 4 for The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

Abstract:Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In this paper, we show that the BART sampler often converges slowly, confirming empirical observations by other researchers. Assuming discrete covariates, we show that, while the BART posterior concentrates on a set comprising all optimal tree structures (smallest bias and complexity), the Markov chain's hitting time for this set increases with $n$ (training sample size), under several common data generative settings. As $n$ increases, the approximate BART posterior thus becomes increasingly different from the exact posterior (for the same number of MCMC samples), contrasting with earlier concentration results on the exact posterior. This contrast is highlighted by our simulations showing worsening frequentist undercoverage for approximate posterior intervals and a growing ratio between the MSE of the approximate posterior and that obtainable by artificially improving convergence via averaging multiple sampler chains. Finally, based on our theoretical insights, possibilities are discussed to improve the BART sampler convergence performance.

Via

Access Paper or Ask Questions

ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Jun 14, 2024

Omer Ronen, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, Bin Yu

Figure 1 for ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Figure 2 for ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Figure 3 for ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Figure 4 for ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Abstract:We develop Scalable Latent Exploration Score (ScaLES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. ScaLES is an exact and theoretically motivated method leveraging the trained decoder's approximation of the data distribution. ScaLES can be calculated with any existing decoder, e.g. from a VAE, without additional training, architectural changes, or access to the training data. Our evaluation across five LSO benchmark tasks and three VAE architectures demonstrates that ScaLES enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions. We believe that new avenues to LSO will be opened by ScaLES ability to identify out of distribution areas, differentiability, and computational tractability. Open source code for ScaLES is available at https://github.com/OmerRonen/scales.

Via

Access Paper or Ask Questions

The Impact of Initialization on LoRA Finetuning Dynamics

Jun 12, 2024

Soufiane Hayou, Nikhil Ghosh, Bin Yu

Figure 1 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 2 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 3 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 4 for The Impact of Initialization on LoRA Finetuning Dynamics

Abstract:In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

* TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

Via

Access Paper or Ask Questions

Minimum-Norm Interpolation Under Covariate Shift

Mar 31, 2024

Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu

Figure 1 for Minimum-Norm Interpolation Under Covariate Shift

Figure 2 for Minimum-Norm Interpolation Under Covariate Shift

Figure 3 for Minimum-Norm Interpolation Under Covariate Shift

Figure 4 for Minimum-Norm Interpolation Under Covariate Shift

Abstract:Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as \textit{benign overfitting}, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of \textit{beneficial} and \textit{malignant} covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies that show these beneficial and malignant covariate shifts for linear interpolators on real image data, and for fully-connected neural networks in settings where the input data dimension is larger than the training sample size.

Via

Access Paper or Ask Questions

A Signature Based Approach Towards Global Channel Charting with Ultra Low Complexity

Mar 29, 2024

Longhai Zhao, Yunchuan Yang, Qi Xiong, He Wang, Bin Yu, Feifei Sun, Chengjun Sun

Figure 1 for A Signature Based Approach Towards Global Channel Charting with Ultra Low Complexity

Figure 2 for A Signature Based Approach Towards Global Channel Charting with Ultra Low Complexity

Figure 3 for A Signature Based Approach Towards Global Channel Charting with Ultra Low Complexity

Figure 4 for A Signature Based Approach Towards Global Channel Charting with Ultra Low Complexity

Abstract:Channel charting, an unsupervised learning method that learns a low-dimensional representation from channel information to preserve geometrical property of physical space of user equipments (UEs), has drawn many attentions from both academic and industrial communities, because it can facilitate many downstream tasks, such as indoor localization, UE handover, beam management, and so on. However, many previous works mainly focus on charting that only preserves local geometry and use raw channel information to learn the chart, which do not consider the global geometry and are often computationally intensive and very time-consuming. Therefore, in this paper, a novel signature based approach for global channel charting with ultra low complexity is proposed. By using an iterated-integral based method called signature transform, a compact feature map and a novel distance metric are proposed, which enable channel charting with ultra low complexity and preserving both local and global geometry. We demonstrate the efficacy of our method using synthetic and open-source real-field datasets.

* accepted by IEEE ICC 2024 Workshops

Via

Access Paper or Ask Questions

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Feb 24, 2024

Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

Figure 1 for Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Figure 2 for Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Figure 3 for Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Abstract:We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

Via

Access Paper or Ask Questions

ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance

Feb 21, 2024

Liwen Sun, Abhineet Agarwal, Aaron Kornblith, Bin Yu, Chenyan Xiong

Abstract:In the emergency department (ED), patients undergo triage and multiple laboratory tests before diagnosis. This process is time-consuming, and causes ED crowding which significantly impacts patient mortality, medical errors, staff burnout, etc. This work proposes (time) cost-effective diagnostic assistance that explores the potential of artificial intelligence (AI) systems in assisting ED clinicians to make time-efficient and accurate diagnoses. Using publicly available patient data, we collaborate with ED clinicians to curate MIMIC-ED-Assist, a benchmark that measures the ability of AI systems in suggesting laboratory tests that minimize ED wait times, while correctly predicting critical outcomes such as death. We develop ED-Copilot which sequentially suggests patient-specific laboratory tests and makes diagnostic predictions. ED-Copilot uses a pre-trained bio-medical language model to encode patient information and reinforcement learning to minimize ED wait time and maximize prediction accuracy of critical outcomes. On MIMIC-ED-Assist, ED-Copilot improves prediction accuracy over baselines while halving average wait time from four hours to two hours. Ablation studies demonstrate the importance of model scale and use of a bio-medical language model. Further analyses reveal the necessity of personalized laboratory test suggestions for diagnosing patients with severe cases, as well as the potential of ED-Copilot in providing ED clinicians with informative laboratory test recommendations. Our code is available at https://github.com/cxcscmu/ED-Copilot.

Via

Access Paper or Ask Questions

LoRA+: Efficient Low Rank Adaptation of Large Models

Feb 19, 2024

Soufiane Hayou, Nikhil Ghosh, Bin Yu

Figure 1 for LoRA+: Efficient Low Rank Adaptation of Large Models

Figure 2 for LoRA+: Efficient Low Rank Adaptation of Large Models

Figure 3 for LoRA+: Efficient Low Rank Adaptation of Large Models

Figure 4 for LoRA+: Efficient Low Rank Adaptation of Large Models

Abstract:In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.

* 27 pages

Via

Access Paper or Ask Questions