Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Audrey Durand

Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Aug 13, 2025

Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

Abstract:Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.

Via

Access Paper or Ask Questions

WAPTS: A Weighted Allocation Probability Adjusted Thompson Sampling Algorithm for High-Dimensional and Sparse Experiment Settings

Jan 07, 2025

Haochen Song, Ilya Musabirov, Ananya Bhattacharjee, Audrey Durand, Meredith Franklin, Anna Rafferty, Joseph Jay Williams

Abstract:Aiming for more effective experiment design, such as in video content advertising where different content options compete for user engagement, these scenarios can be modeled as multi-arm bandit problems. In cases where limited interactions are available due to external factors, such as the cost of conducting experiments, recommenders often face constraints due to the small number of user interactions. In addition, there is a trade-off between selecting the best treatment and the ability to personalize and contextualize based on individual factors. A popular solution to this dilemma is the Contextual Bandit framework. It aims to maximize outcomes while incorporating personalization (contextual) factors, customizing treatments such as a user's profile to individual preferences. Despite their advantages, Contextual Bandit algorithms face challenges like measurement bias and the 'curse of dimensionality.' These issues complicate the management of numerous interventions and often lead to data sparsity through participant segmentation. To address these problems, we introduce the Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS) algorithm. WAPTS builds on the contextual Thompson Sampling method by using a dynamic weighting parameter. This improves the allocation process for interventions and enables rapid optimization in data-sparse environments. We demonstrate the performance of our approach on different numbers of arms and effect sizes.

Via

Access Paper or Ask Questions

On shallow planning under partial observability

Jul 22, 2024

Randy Lefebvre, Audrey Durand

Abstract:Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the biasvariance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.

* Presented at deployable RL (RLC conference 2024)

Via

Access Paper or Ask Questions

Neural Active Learning Meets the Partial Monitoring Framework

May 14, 2024

Maxime Heuillet, Ola Ahmad, Audrey Durand

Figure 1 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 2 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 3 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 4 for Neural Active Learning Meets the Partial Monitoring Framework

Abstract:We focus on the online-based active learning (OAL) setting where an agent operates over a stream of observations and trades-off between the costly acquisition of information (labelled observations) and the cost of prediction errors. We propose a novel foundation for OAL tasks based on partial monitoring, a theoretical framework specialized in online learning from partially informative actions. We show that previously studied binary and multi-class OAL tasks are instances of partial monitoring. We expand the real-world potential of OAL by introducing a new class of cost-sensitive OAL tasks. We propose NeuralCBP, the first PM strategy that accounts for predictive uncertainty with deep neural networks. Our extensive empirical evaluation on open source datasets shows that NeuralCBP has favorable performance against state-of-the-art baselines on multiple binary, multi-class and cost-sensitive OAL tasks.

Via

Access Paper or Ask Questions

Randomized Confidence Bounds for Stochastic Partial Monitoring

Feb 07, 2024

Maxime Heuillet, Ola Ahmad, Audrey Durand

Figure 1 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 2 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 3 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 4 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Abstract:The partial monitoring (PM) framework provides a theoretical formulation of sequential learning problems with incomplete feedback. On each round, a learning agent plays an action while the environment simultaneously chooses an outcome. The agent then observes a feedback signal that is only partially informative about the (unobserved) outcome. The agent leverages the received feedback signals to select actions that minimize the (unobserved) cumulative loss. In contextual PM, the outcomes depend on some side information that is observable by the agent before selecting the action on each round. In this paper, we consider the contextual and non-contextual PM settings with stochastic outcomes. We introduce a new class of strategies based on the randomization of deterministic confidence bounds, that extend regret guarantees to settings where existing stochastic strategies are not applicable. Our experiments show that the proposed RandCBP and RandCBPside* strategies improve state-of-the-art baselines in PM games. To encourage the adoption of the PM framework, we design a use case on the real-world problem of monitoring the error rate of any deployed classification system.

Via

Access Paper or Ask Questions

Association Rules Mining with Auto-Encoders

Apr 26, 2023

Théophile Berteloot, Richard Khoury, Audrey Durand

Figure 1 for Association Rules Mining with Auto-Encoders

Figure 2 for Association Rules Mining with Auto-Encoders

Figure 3 for Association Rules Mining with Auto-Encoders

Figure 4 for Association Rules Mining with Auto-Encoders

Abstract:Association rule mining is one of the most studied research fields of data mining, with applications ranging from grocery basket problems to explainable classification systems. Classical association rule mining algorithms have several limitations, especially with regards to their high execution times and number of rules produced. Over the past decade, neural network solutions have been used to solve various optimization problems, such as classification, regression or clustering. However there are still no efficient way association rules using neural networks. In this paper, we present an auto-encoder solution to mine association rule called ARM-AE. We compare our algorithm to FP-Growth and NSGAII on three categorical datasets, and show that our algorithm discovers high support and confidence rule set and has a better execution time than classical methods while preserving the quality of the rule set produced.

Via

Access Paper or Ask Questions

Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Feb 19, 2023

Rupali Bhati, Jennifer Jones, Audrey Durand

Figure 1 for Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Figure 2 for Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Figure 3 for Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Figure 4 for Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Abstract:Cancer treatment is an arduous process for patients and causes many side-effects during and post-treatment. The treatment can affect almost all body systems and result in pain, fatigue, sleep disturbances, cognitive impairments, etc. These conditions are often under-diagnosed or under-treated. In this paper, we use patient data to predict the evolution of their symptoms such that treatment-related impairments can be prevented or effects meaningfully ameliorated. The focus of this study is on predicting the pain and tiredness level of a patient post their diagnosis. We implement an interpretable decision tree based model called LightGBM on real-world patient data consisting of 20163 patients. There exists a class imbalance problem in the dataset which we resolve using the oversampling technique of SMOTE. Our empirical results show that the value of the previous level of a symptom is a key indicator for prediction and the weighted average deviation in prediction of pain level is 3.52 and of tiredness level is 2.27.

Via

Access Paper or Ask Questions

Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy

Dec 10, 2022

Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois

Abstract:Polypharmacy, most often defined as the simultaneous consumption of five or more drugs at once, is a prevalent phenomenon in the older population. Some of these polypharmacies, deemed inappropriate, may be associated with adverse health outcomes such as death or hospitalization. Considering the combinatorial nature of the problem as well as the size of claims database and the cost to compute an exact association measure for a given drug combination, it is impossible to investigate every possible combination of drugs. Therefore, we propose to optimize the search for potentially inappropriate polypharmacies (PIPs). To this end, we propose the OptimNeuralTS strategy, based on Neural Thompson Sampling and differential evolution, to efficiently mine claims datasets and build a predictive model of the association between drug combinations and health outcomes. We benchmark our method using two datasets generated by an internally developed simulator of polypharmacy data containing 500 drugs and 100 000 distinct combinations. Empirically, our method can detect up to 33\% of PIPs while maintaining an average precision score of 99\% using 10 000 time steps.

Via

Access Paper or Ask Questions

Cambrian Explosion Algorithm for Multi-Objective Association Rules Mining

Nov 23, 2022

Théophile Berteloot, Richard Khoury, Audrey Durand

Abstract:Association rule mining is one of the most studied research fields of data mining, with applications ranging from grocery basket problems to highly explainable classification systems. Classical association rule mining algorithms have several flaws especially with regards to their execution times, memory usage and number of rules produced. An alternative is the use of meta-heuristics, which have been used on several optimisation problems. This paper has two objectives. First, we provide a comparison of the performances of state-of-the-art meta-heuristics on the association rule mining problem. We use the multi-objective versions of those algorithms using support, confidence and cosine. Second, we propose a new algorithm designed to mine rules efficiently from massive datasets by exploring a large variety of solutions, akin to the explosion of species diversity of the Cambrian Explosion. We compare our algorithm to 20 benchmark algorithms on 22 real-world data-sets, and show that our algorithm present good results and outperform several state-of-the-art algorithms.

Via

Access Paper or Ask Questions

GrowSpace: Learning How to Shape Plants

Oct 15, 2021

Yasmeen Hitti, Ionelia Buzatu, Manuel Del Verme, Mark Lefsrud, Florian Golemo, Audrey Durand

Figure 1 for GrowSpace: Learning How to Shape Plants

Figure 2 for GrowSpace: Learning How to Shape Plants

Figure 3 for GrowSpace: Learning How to Shape Plants

Figure 4 for GrowSpace: Learning How to Shape Plants

Abstract:Plants are dynamic systems that are integral to our existence and survival. Plants face environment changes and adapt over time to their surrounding conditions. We argue that plant responses to an environmental stimulus are a good example of a real-world problem that can be approached within a reinforcement learning (RL)framework. With the objective of controlling a plant by moving the light source, we propose GrowSpace, as a new RL benchmark. The back-end of the simulator is implemented using the Space Colonisation Algorithm, a plant growing model based on competition for space. Compared to video game RL environments, this simulator addresses a real-world problem and serves as a test bed to visualize plant growth and movement in a faster way than physical experiments. GrowSpace is composed of a suite of challenges that tackle several problems such as control, multi-stage learning,fairness and multi-objective learning. We provide agent baselines alongside case studies to demonstrate the difficulty of the proposed benchmark.

Via

Access Paper or Ask Questions