Alert button
Picture for Lingjiao Chen

Lingjiao Chen

Alert button

How is ChatGPT's behavior changing over time?

Aug 01, 2023
Lingjiao Chen, Matei Zaharia, James Zou

Figure 1 for How is ChatGPT's behavior changing over time?
Figure 2 for How is ChatGPT's behavior changing over time?
Figure 3 for How is ChatGPT's behavior changing over time?
Figure 4 for How is ChatGPT's behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

* add more evaluations 
Viaarxiv icon

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

May 09, 2023
Lingjiao Chen, Matei Zaharia, James Zou

Figure 1 for FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Figure 2 for FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Figure 3 for FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Figure 4 for FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.

Viaarxiv icon

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

Oct 11, 2022
Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Mitchell, James Zou

Figure 1 for SEAL : Interactive Tool for Systematic Error Analysis and Labeling
Figure 2 for SEAL : Interactive Tool for Systematic Error Analysis and Labeling
Figure 3 for SEAL : Interactive Tool for Systematic Error Analysis and Labeling
Figure 4 for SEAL : Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not obvious in aggregate evaluation. Identifying such problematic data groups is even more challenging when there are no explicit labels (e.g., ethnicity, gender, etc.) and further compounded for NLP datasets due to the lack of visual features to characterize failure modes (e.g., Asian males, animals indoors, waterbirds on land, etc.). This paper introduces an interactive Systematic Error Analysis and Labeling (\seal) tool that uses a two-step approach to first identify high error slices of data and then, in the second step, introduce methods to give human-understandable semantics to those underperforming slices. We explore a variety of methods for coming up with coherent semantics for the error groups using language models for semantic labeling and a text-to-image model for generating visual features. SEAL toolkit and demo screencast is available at https://huggingface.co/spaces/nazneen/seal.

* Accepted at EMNLP 2022 demo track 
Viaarxiv icon

HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions

Sep 18, 2022
Lingjiao Chen, Zhihua Jin, Sabri Eyuboglu, Christopher Ré, Matei Zaharia, James Zou

Figure 1 for HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions
Figure 2 for HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions
Figure 3 for HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions
Figure 4 for HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions

Commercial ML APIs offered by providers such as Google, Amazon and Microsoft have dramatically simplified ML adoption in many applications. Numerous companies and academics pay to use ML APIs for tasks such as object detection, OCR and sentiment analysis. Different ML APIs tackling the same task can have very heterogeneous performance. Moreover, the ML models underlying the APIs also evolve over time. As ML APIs rapidly become a valuable marketplace and a widespread way to consume machine learning, it is critical to systematically study and compare different APIs with each other and to characterize how APIs change over time. However, this topic is currently underexplored due to the lack of data. In this paper, we present HAPI (History of APIs), a longitudinal dataset of 1,761,417 instances of commercial ML API applications (involving APIs from Amazon, Google, IBM, Microsoft and other providers) across diverse tasks including image tagging, speech recognition and text mining from 2020 to 2022. Each instance consists of a query input for an API (e.g., an image or text) along with the API's output prediction/annotation and confidence scores. HAPI is the first large-scale dataset of ML API usages and is a unique resource for studying ML-as-a-service (MLaaS). As examples of the types of analyses that HAPI enables, we show that ML APIs' performance change substantially over time--several APIs' accuracies dropped on specific benchmark datasets. Even when the API's aggregate performance stays steady, its error modes can shift across different subtypes of data between 2020 and 2022. Such changes can substantially impact the entire analytics pipelines that use some ML API as a component. We further use HAPI to study commercial APIs' performance disparities across demographic subgroups over time. HAPI can stimulate more research in the growing field of MLaaS.

* Preprint, to appear in NeurIPS 2022 
Viaarxiv icon

Estimating and Explaining Model Performance When Both Covariates and Labels Shift

Sep 18, 2022
Lingjiao Chen, Matei Zaharia, James Zou

Figure 1 for Estimating and Explaining Model Performance When Both Covariates and Labels Shift
Figure 2 for Estimating and Explaining Model Performance When Both Covariates and Labels Shift
Figure 3 for Estimating and Explaining Model Performance When Both Covariates and Labels Shift
Figure 4 for Estimating and Explaining Model Performance When Both Covariates and Labels Shift

Deployed machine learning (ML) models often encounter new user data that differs from their training data. Therefore, estimating how well a given model might perform on the new data is an important step toward reliable ML applications. This is very challenging, however, as the data distribution can change in flexible ways, and we may not have any labels on the new data, which is often the case in monitoring settings. In this paper, we propose a new distribution shift model, Sparse Joint Shift (SJS), which considers the joint shift of both labels and a few features. This unifies and generalizes several existing shift models including label shift and sparse covariate shift, where only marginal feature or label distribution shifts are considered. We describe mathematical conditions under which SJS is identifiable. We further propose SEES, an algorithmic framework to characterize the distribution shift under SJS and to estimate a model's performance on new data without any labels. We conduct extensive experiments on several real-world datasets with various ML models. Across different datasets and distribution shifts, SEES achieves significant (up to an order of magnitude) shift estimation error improvements over existing approaches.

* Accepted to NeurIPS 2022 
Viaarxiv icon

Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients

Oct 09, 2021
Lingjiao Chen, Leshang Chen, Hongyi Wang, Susan Davidson, Edgar Dobriban

Figure 1 for Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients
Figure 2 for Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients
Figure 3 for Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients
Figure 4 for Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients

There has been a growing need to provide Byzantine-resilience in distributed model training. Existing robust distributed learning algorithms focus on developing sophisticated robust aggregators at the parameter servers, but pay less attention to balancing the communication cost and robustness. In this paper, we propose Solon, an algorithmic framework that exploits gradient redundancy to provide communication efficiency and Byzantine robustness simultaneously. Our theoretical analysis shows a fundamental trade-off among computational load, communication cost, and Byzantine robustness. We also develop a concrete algorithm to achieve the optimal trade-off, borrowing ideas from coding theory and sparse recovery. Empirical experiments on various datasets demonstrate that Solon provides significant speedups over existing methods to achieve the same accuracy, over 10 times faster than Bulyan and 80% faster than Draco. We also show that carefully designed Byzantine attacks break Signum and Bulyan, but do not affect the successful convergence of Solon.

Viaarxiv icon

Did the Model Change? Efficiently Assessing Machine Learning API Shifts

Jul 29, 2021
Lingjiao Chen, Tracy Cai, Matei Zaharia, James Zou

Figure 1 for Did the Model Change? Efficiently Assessing Machine Learning API Shifts
Figure 2 for Did the Model Change? Efficiently Assessing Machine Learning API Shifts
Figure 3 for Did the Model Change? Efficiently Assessing Machine Learning API Shifts
Figure 4 for Did the Model Change? Efficiently Assessing Machine Learning API Shifts

Machine learning (ML) prediction APIs are increasingly widely used. An ML API can change over time due to model updates or retraining. This presents a key challenge in the usage of the API because it is often not clear to the user if and how the ML model has changed. Model shifts can affect downstream application performance and also create oversight issues (e.g. if consistency is desired). In this paper, we initiate a systematic investigation of ML API shifts. We first quantify the performance shifts from 2020 to 2021 of popular ML APIs from Google, Microsoft, Amazon, and others on a variety of datasets. We identified significant model shifts in 12 out of 36 cases we investigated. Interestingly, we found several datasets where the API's predictions became significantly worse over time. This motivated us to formulate the API shift assessment problem at a more fine-grained level as estimating how the API model's confusion matrix changes over time when the data distribution is constant. Monitoring confusion matrix shifts using standard random sampling can require a large number of samples, which is expensive as each API call costs a fee. We propose a principled adaptive sampling algorithm, MASA, to efficiently estimate confusion matrix shifts. MASA can accurately estimate the confusion matrix shifts in commercial ML APIs using up to 90% fewer samples compared to random sampling. This work establishes ML API shifts as an important problem to study and provides a cost-effective approach to monitor such shifts.

Viaarxiv icon

FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks

Feb 18, 2021
Lingjiao Chen, Matei Zaharia, James Zou

Figure 1 for FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks
Figure 2 for FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks
Figure 3 for FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks
Figure 4 for FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks

Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance. Recent work shows how to select from single-label prediction APIs. However the computation complexity of the previous approach is exponential in the number of labels and hence is not suitable for settings like OCR. In this work, we propose FrugalMCT, a principled framework that adaptively selects the APIs to use for different data in an online fashion while respecting user's budget. The API selection problem is cast as an integer linear program, which we show has a special structure that we leverage to develop an efficient online API selector with strong performance guarantees. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Tencent and other providers for tasks including multi-label image classification, scene text recognition and named entity recognition. Across diverse tasks, FrugalMCT can achieve over 90% cost reduction while matching the accuracy of the best single API, or up to 8% better accuracy while matching the best API's cost.

Viaarxiv icon

FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply

Jun 12, 2020
Lingjiao Chen, Matei Zaharia, James Zou

Figure 1 for FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply
Figure 2 for FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply
Figure 3 for FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply
Figure 4 for FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply

Prediction APIs offered for a fee are a fast-growing industry and an important part of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API's cost.

Viaarxiv icon