Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cho-Jui Hsieh

UCLA

PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

Dec 06, 2023

Wei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang, Mutasem Al-Darabsah, Choon Hui Teo, Cho-Jui Hsieh, Hsiang-Fu Yu, S. V. N. Vishwanathan

Figure 1 for PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

Figure 2 for PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

Figure 3 for PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

Figure 4 for PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

Abstract:Embedding-based Retrieval Models (ERMs) have emerged as a promising framework for large-scale text retrieval problems due to powerful large language models. Nevertheless, fine-tuning ERMs to reach state-of-the-art results can be expensive due to the extreme scale of data as well as the complexity of multi-stages pipelines (e.g., pre-training, fine-tuning, distillation). In this work, we propose the PEFA framework, namely ParamEter-Free Adapters, for fast tuning of ERMs without any backward pass in the optimization. At index building stage, PEFA equips the ERM with a non-parametric k-nearest neighbor (kNN) component. At inference stage, PEFA performs a convex combination of two scoring functions, one from the ERM and the other from the kNN. Based on the neighborhood definition, PEFA framework induces two realizations, namely PEFA-XL (i.e., extra large) using double ANN indices and PEFA-XS (i.e., extra small) using a single ANN index. Empirically, PEFA achieves significant improvement on two retrieval applications. For document retrieval, regarding Recall@100 metric, PEFA improves not only pre-trained ERMs on Trivia-QA by an average of 13.2%, but also fine-tuned ERMs on NQ-320K by an average of 5.5%, respectively. For product search, PEFA improves the Recall@100 of the fine-tuned ERMs by an average of 5.3% and 14.5%, for PEFA-XS and PEFA-XL, respectively. Our code is available at https://github.com/amzn/pecos/tree/mainline/examples/pefa-wsdm24.

* Accept by WSDM 2024

Via

Access Paper or Ask Questions

Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

Nov 16, 2023

Yuhang Li, Yihan Wang, Zhouxing Shi, Cho-Jui Hsieh

Figure 1 for Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

Figure 2 for Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

Figure 3 for Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

Abstract:The strong general capabilities of Large Language Models (LLMs) bring potential ethical risks if they are unrestrictedly accessible to malicious users. Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions with a private random number generator seeded by its prefix tokens. However, this watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality if it chooses to promote tokens that are less relevant given the input. In this work, we propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS). At each generation step, we estimate the importance of the token to generate, and prevent it from being impacted by watermarking if it is important for the semantic correctness of the output. We further propose three methods to predict importance scoring, including a perturbation-based method and two model-based methods. Empirical experiments show that our method can generate texts with better quality with comparable level of detection rate.

* Work in progress

Via

Access Paper or Ask Questions

A Computationally Efficient Sparsified Online Newton Method

Nov 16, 2023

Fnu Devvrit, Sai Surya Duvvuri, Rohan Anil, Vineet Gupta, Cho-Jui Hsieh, Inderjit Dhillon

Abstract:Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew

* 30 pages. First two authors contributed equally. Accepted at NeurIPS 2023

Via

Access Paper or Ask Questions

Automatic Engineering of Long Prompts

Nov 16, 2023

Cho-Jui Hsieh, Si Si, Felix X. Yu, Inderjit S. Dhillon

Figure 1 for Automatic Engineering of Long Prompts

Figure 2 for Automatic Engineering of Long Prompts

Figure 3 for Automatic Engineering of Long Prompts

Figure 4 for Automatic Engineering of Long Prompts

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in solving complex open-domain tasks, guided by comprehensive instructions and demonstrations provided in the form of prompts. However, these prompts can be lengthy, often comprising hundreds of lines and thousands of tokens, and their design often requires considerable human effort. Recent research has explored automatic prompt engineering for short prompts, typically consisting of one or a few sentences. However, the automatic design of long prompts remains a challenging problem due to its immense search space. In this paper, we investigate the performance of greedy algorithms and genetic algorithms for automatic long prompt engineering. We demonstrate that a simple greedy approach with beam search outperforms other methods in terms of search efficiency. Moreover, we introduce two novel techniques that utilize search history to enhance the effectiveness of LLM-based mutation in our search algorithm. Our results show that the proposed automatic long prompt engineering algorithm achieves an average of 9.2% accuracy gain on eight tasks in Big Bench Hard, highlighting the significance of automating prompt designs to fully harness the capabilities of LLMs.

Via

Access Paper or Ask Questions

Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Oct 18, 2023

Liu Liu, Xuanqing Liu, Cho-Jui Hsieh, Dacheng Tao

Figure 1 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 2 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 3 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 4 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Abstract:Trust-region (TR) and adaptive regularization using cubics (ARC) have proven to have some very appealing theoretical properties for non-convex optimization by concurrently computing function value, gradient, and Hessian matrix to obtain the next search direction and the adjusted parameters. Although stochastic approximations help largely reduce the computational cost, it is challenging to theoretically guarantee the convergence rate. In this paper, we explore a family of stochastic TR and ARC methods that can simultaneously provide inexact computations of the Hessian matrix, gradient, and function values. Our algorithms require much fewer propagations overhead per iteration than TR and ARC. We prove that the iteration complexity to achieve $\epsilon$-approximate second-order optimality is of the same order as the exact computations demonstrated in previous studies. Additionally, the mild conditions on inexactness can be met by leveraging a random sampling technology in the finite-sum minimization problem. Numerical experiments with a non-convex problem support these findings and demonstrate that, with the same or a similar number of iterations, our algorithms require less computational overhead per iteration than current second-order methods.

* arXiv admin note: text overlap with arXiv:1809.09853

Via

Access Paper or Ask Questions

Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum Systems

Oct 14, 2023

Lucas Tecot, Cho-Jui Hsieh

Abstract:In the field of quantum information, classical optimizers play an important role. From experimentalists optimizing their physical devices to theorists exploring variational quantum algorithms, many aspects of quantum information require the use of a classical optimizer. For this reason, there are many papers that benchmark the effectiveness of different optimizers for specific quantum optimization tasks and choices of parameterized algorithms. However, for researchers exploring new algorithms or physical devices, the insights from these studies don't necessarily translate. To address this concern, we compare the performance of classical optimizers across a series of partially-randomized tasks to more broadly sample the space of quantum optimization problems. We focus on local zeroth-order optimizers due to their generally favorable performance and query-efficiency on quantum systems. We discuss insights from these experiments that can help motivate future works to improve these optimizers for use on quantum systems.

Via

Access Paper or Ask Questions

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Oct 11, 2023

Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, Quanquan Gu

Figure 1 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 2 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 3 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 4 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Abstract:The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. This paper fills this gap by demonstrating why SAM generalizes better than Stochastic Gradient Descent (SGD) for a certain data model and two-layer convolutional ReLU networks. The loss landscape of our studied problem is nonsmooth, thus current explanations for the success of SAM based on the Hessian information are insufficient. Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features. Experiments on both synthetic and real data corroborate our theory.

* 52 pages, 4 figures, 2 tables. In NeurIPS 2023

Via

Access Paper or Ask Questions

MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Oct 08, 2023

Xiusi Chen, Jyun-Yu Jiang, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Wei Wang

Figure 1 for MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Figure 2 for MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Figure 3 for MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Figure 4 for MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Abstract:Few-shot question answering (QA) aims at achieving satisfactory results on machine question answering when only a few training samples are available. Recent advances mostly rely on the power of pre-trained large language models (LLMs) and fine-tuning in specific settings. Although the pre-training stage has already equipped LLMs with powerful reasoning capabilities, LLMs still need to be fine-tuned to adapt to specific domains to achieve the best results. In this paper, we propose to select the most informative data for fine-tuning, thereby improving the efficiency of the fine-tuning process with comparative or even better accuracy on the open-domain QA task. We present MinPrompt, a minimal data augmentation framework for open-domain QA based on an approximate graph algorithm and unsupervised question generation. We transform the raw text into a graph structure to build connections between different factual sentences, then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We then generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model. Empirical results on several benchmark datasets and theoretical analysis show that MinPrompt is able to achieve comparable or better results than baselines with a high degree of efficiency, bringing improvements in F-1 scores by up to 27.5%.

Via

Access Paper or Ask Questions

Red Teaming Language Model Detectors with Language Models

May 31, 2023

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh

Abstract:The prevalence and high capacity of large language models (LLMs) present significant safety and ethical risks when malicious users exploit them for automated content generation. To prevent the potentially deceptive usage of LLMs, recent works have proposed several algorithms to detect machine-generated text. In this paper, we systematically test the reliability of the existing detectors, by designing two types of attack strategies to fool the detectors: 1) replacing words with their synonyms based on the context; 2) altering the writing style of generated text. These strategies are implemented by instructing LLMs to generate synonymous word substitutions or writing directives that modify the style without human involvement, and the LLMs leveraged in the attack can also be protected by detectors. Our research reveals that our attacks effectively compromise the performance of all tested detectors, thereby underscoring the urgent need for the development of more robust machine-generated text detection systems.

* Work in progress. Zhouxing Shi, Yihan Wang and Fan Yin are ordered alphabetically

Via

Access Paper or Ask Questions

Representer Point Selection for Explaining Regularized High-dimensional Models

May 31, 2023

Che-Ping Tsai, Jiong Zhang, Eli Chien, Hsiang-Fu Yu, Cho-Jui Hsieh, Pradeep Ravikumar

Figure 1 for Representer Point Selection for Explaining Regularized High-dimensional Models

Figure 2 for Representer Point Selection for Explaining Regularized High-dimensional Models

Figure 3 for Representer Point Selection for Explaining Regularized High-dimensional Models

Figure 4 for Representer Point Selection for Explaining Regularized High-dimensional Models

Abstract:We introduce a novel class of sample-based explanations we term high-dimensional representers, that can be used to explain the predictions of a regularized high-dimensional model in terms of importance weights for each of the training samples. Our workhorse is a novel representer theorem for general regularized high-dimensional models, which decomposes the model prediction in terms of contributions from each of the training samples: with positive (negative) values corresponding to positive (negative) impact training samples to the model's prediction. We derive consequences for the canonical instances of $\ell_1$ regularized sparse models, and nuclear norm regularized low-rank models. As a case study, we further investigate the application of low-rank models in the context of collaborative filtering, where we instantiate high-dimensional representers for specific popular classes of models. Finally, we study the empirical performance of our proposed methods on three real-world binary classification datasets and two recommender system datasets. We also showcase the utility of high-dimensional representers in explaining model recommendations.

* Accepted by ICML 2023

Via

Access Paper or Ask Questions