Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexandre Drouin

Evaluating Interventional Reasoning Capabilities of Large Language Models

Apr 08, 2024

Tejas Kasetty, Divyat Mahajan, Gintare Karolina Dziugaite, Alexandre Drouin, Dhanya Sridhar

Abstract:Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts.

* 17 pages

Via

Access Paper or Ask Questions

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Mar 12, 2024

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez(+2 more)

Figure 1 for WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Figure 2 for WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Figure 3 for WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Figure 4 for WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Abstract:We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

* 27 pages, 10 figures, preprint

Via

Access Paper or Ask Questions

Capture the Flag: Uncovering Data Insights with Large Language Models

Dec 21, 2023

Issam Laradji, Perouz Taslakian, Sai Rajeswar, Valentina Zantedeschi, Alexandre Lacoste, Nicolas Chapados, David Vazquez, Christopher Pal, Alexandre Drouin

Abstract:The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.

* 14 pages, 1 figure, Foundation Models for Decision Making Workshop at NeurIPS 2023

Via

Access Paper or Ask Questions

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Oct 12, 2023

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider(+5 more)

Figure 1 for Lag-Llama: Towards Foundation Models for Time Series Forecasting

Figure 2 for Lag-Llama: Towards Foundation Models for Time Series Forecasting

Figure 3 for Lag-Llama: Towards Foundation Models for Time Series Forecasting

Figure 4 for Lag-Llama: Towards Foundation Models for Time Series Forecasting

Abstract:Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.

Via

Access Paper or Ask Questions

TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Oct 02, 2023

Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Nicolas Chapados, Alexandre Drouin

Figure 1 for TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Figure 2 for TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Figure 3 for TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Figure 4 for TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Abstract:We introduce a new model for multivariate probabilistic time series prediction, designed to flexibly address a range of tasks including forecasting, interpolation, and their combinations. Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS), wherein the number of distributional parameters now scales linearly with the number of variables instead of factorially. The new objective requires the introduction of a training curriculum, which goes hand-in-hand with necessary changes to the original architecture. We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks, while maintaining the flexibility of prior work, such as seamless handling of unaligned and unevenly-sampled time series.

Via

Access Paper or Ask Questions

Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

Jul 30, 2023

Chris Chinenye Emezue, Alexandre Drouin, Tristan Deleu, Stefan Bauer, Yoshua Bengio

Figure 1 for Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

Figure 2 for Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

Figure 3 for Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

Figure 4 for Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

Abstract:The practical utility of causality in decision-making is widespread and brought about by the intertwining of causal discovery and causal inference. Nevertheless, a notable gap exists in the evaluation of causal discovery methods, where insufficient emphasis is placed on downstream inference. To address this gap, we evaluate seven established baseline causal discovery methods including a newly proposed method based on GFlowNets, on the downstream task of treatment effect estimation. Through the implementation of a distribution-level evaluation, we offer valuable and unique insights into the efficacy of these causal discovery methods for treatment effect estimation, considering both synthetic and real-world scenarios, as well as low-data scenarios. The results of our study demonstrate that some of the algorithms studied are able to effectively capture a wide range of useful and diverse ATE modes, while some tend to learn many low-probability modes which impacts the (unrelaxed) recall and precision.

* Peer-reviewed and Accepted to ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling

Via

Access Paper or Ask Questions

Causal Discovery with Language Models as Imperfect Experts

Jul 05, 2023

Stephanie Long, Alexandre Piché, Valentina Zantedeschi, Tibor Schuster, Alexandre Drouin

Abstract:Understanding the causal relationships that underlie a system is a fundamental prerequisite to accurate decision-making. In this work, we explore how expert knowledge can be used to improve the data-driven identification of causal graphs, beyond Markov equivalence classes. In doing so, we consider a setting where we can query an expert about the orientation of causal relationships between variables, but where the expert may provide erroneous information. We propose strategies for amending such expert knowledge based on consistency properties, e.g., acyclicity and conditional independencies in the equivalence class. We then report a case study, on real data, where a large language model is used as an imperfect expert.

* Peer reviewed and accepted for presentation at the Structured Probabilistic Inference & Generative Modeling (SPIGM) workshop at ICML 2023, Hawaii, USA

Via

Access Paper or Ask Questions

Invariant Causal Set Covering Machines

Jun 07, 2023

Thibaud Godon, Baptiste Bauvin, Pascal Germain, Jacques Corbeil, Alexandre Drouin

Figure 1 for Invariant Causal Set Covering Machines

Figure 2 for Invariant Causal Set Covering Machines

Figure 3 for Invariant Causal Set Covering Machines

Figure 4 for Invariant Causal Set Covering Machines

Abstract:Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations and thus, they are not guaranteed to extract causally-relevant insights. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering Machines, an extension of the classical Set Covering Machine algorithm for conjunctions/disjunctions of binary-valued rules that provably avoids spurious associations. We demonstrate both theoretically and empirically that our method can identify the causal parents of a variable of interest in polynomial time.

Via

Access Paper or Ask Questions

GEO-Bench: Toward Foundation Models for Earth Monitoring

Jun 06, 2023

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin(+7 more)

Figure 1 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 2 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 3 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 4 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Abstract:Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

* arXiv admin note: text overlap with arXiv:2112.00570

Via

Access Paper or Ask Questions

Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Apr 19, 2023

Étienne Marcotte, Valentina Zantedeschi, Alexandre Drouin, Nicolas Chapados

Figure 1 for Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Figure 2 for Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Figure 3 for Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Figure 4 for Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Abstract:Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the "region of reliability" of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.

* 37 pages, 28 figures

Via

Access Paper or Ask Questions