Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ning Miao

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Oct 02, 2025

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao

Abstract:Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

Via

Access Paper or Ask Questions

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Aug 28, 2025

Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth

Abstract:We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.

Via

Access Paper or Ask Questions

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Aug 02, 2023

Ning Miao, Yee Whye Teh, Tom Rainforth

Abstract:The recent progress in large language models (LLMs), especially the invention of chain-of-thoughts (CoT) prompting, makes it possible to solve reasoning problems. However, even the strongest LLMs are still struggling with more complicated problems that require non-linear thinking and multi-step reasoning. In this work, we explore whether LLMs have the ability to recognize their own errors, without resorting to external resources. In particular, we investigate whether they can be used to identify individual errors within a step-by-step reasoning. To this end, we propose a zero-shot verification scheme to recognize such errors. We then use this verification scheme to improve question-answering performance, by using it to perform weighted voting on different generated answers. We test the method on three math datasets-GSM8K, MathQA, and MATH-and find that it successfully recognizes errors and, in turn, increases final predictive performance.

Via

Access Paper or Ask Questions

Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Apr 04, 2023

Jialin Liu, Ning Miao, Chongzhou Fang, Houman Homayoun, Han Wang

Figure 1 for Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Figure 2 for Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Figure 3 for Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Figure 4 for Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Abstract:The Electrocardiogram (ECG) measures the electrical cardiac activity generated by the heart to detect abnormal heartbeat and heart attack. However, the irregular occurrence of the abnormalities demands continuous monitoring of heartbeats. Machine learning techniques are leveraged to automate the task to reduce labor work needed during monitoring. In recent years, many companies have launched products with ECG monitoring and irregular heartbeat alert. Among all classification algorithms, the time series-based algorithm dynamic time warping (DTW) is widely adopted to undertake the ECG classification task. Though progress has been achieved, the DTW-based ECG classification also brings a new attacking vector of leaking the patients' diagnosis results. This paper shows that the ECG input samples' labels can be stolen via a side-channel attack, Flush+Reload. In particular, we first identify the vulnerability of DTW for ECG classification, i.e., the correlation between warping path choice and prediction results. Then we implement an attack that leverages Flush+Reload to monitor the warping path selection with known ECG data and then build a predictor for constructing the relation between warping path selection and labels of input ECG samples. Based on experiments, we find that the Flush+Reload-based inference leakage can achieve an 84.0\% attacking success rate to identify the labels of the two samples in DTW.

Via

Access Paper or Ask Questions

Learning Instance-Specific Data Augmentations

May 31, 2022

Ning Miao, Emile Mathieu, Yann Dubois, Tom Rainforth, Yee Whye Teh, Adam Foster, Hyunjik Kim

Figure 1 for Learning Instance-Specific Data Augmentations

Figure 2 for Learning Instance-Specific Data Augmentations

Figure 3 for Learning Instance-Specific Data Augmentations

Figure 4 for Learning Instance-Specific Data Augmentations

Abstract:Existing data augmentation methods typically assume independence between transformations and inputs: they use the same transformation distribution for all input instances. We explain why this can be problematic and propose InstaAug, a method for automatically learning input-specific augmentations from data. This is achieved by introducing an augmentation module that maps an input to a distribution over transformations. This is simultaneously trained alongside the base model in a fully end-to-end manner using only the training data. We empirically demonstrate that InstaAug learns meaningful augmentations for a wide range of transformation classes, which in turn provides better performance on supervised and self-supervised tasks compared with augmentations that assume input--transformation independence.

Via

Access Paper or Ask Questions

InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

Jun 25, 2021

Ning Miao, Emile Mathieu, N. Siddharth, Yee Whye Teh, Tom Rainforth

Figure 1 for InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

Figure 2 for InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

Figure 3 for InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

Figure 4 for InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

Abstract:We introduce a simple and effective method for learning VAEs with controllable inductive biases by using an intermediary set of latent variables. This allows us to overcome the limitations of the standard Gaussian prior assumption. In particular, it allows us to impose desired properties like sparsity or clustering on learned representations, and incorporate prior information into the learned model. Our approach, which we refer to as the Intermediary Latent Space VAE (InteL-VAE), is based around controlling the stochasticity of the encoding process with the intermediary latent variables, before deterministically mapping them forward to our target latent representation, from which reconstruction is performed. This allows us to maintain all the advantages of the traditional VAE framework, while incorporating desired prior information, inductive biases, and even topological information through the latent mapping. We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.

Via

Access Paper or Ask Questions

Generating Fluent Adversarial Examples for Natural Languages

Jul 13, 2020

Huangzhao Zhang, Hao Zhou, Ning Miao, Lei Li

Figure 1 for Generating Fluent Adversarial Examples for Natural Languages

Figure 2 for Generating Fluent Adversarial Examples for Natural Languages

Figure 3 for Generating Fluent Adversarial Examples for Natural Languages

Figure 4 for Generating Fluent Adversarial Examples for Natural Languages

Abstract:Efficiently building an adversarial attacker for natural language processing (NLP) tasks is a real challenge. Firstly, as the sentence space is discrete, it is difficult to make small perturbations along the direction of gradients. Secondly, the fluency of the generated examples cannot be guaranteed. In this paper, we propose MHA, which addresses both problems by performing Metropolis-Hastings sampling, whose proposal is designed with the guidance of gradients. Experiments on IMDB and SNLI show that our proposed MHA outperforms the baseline model on attacking capability. Adversarial training with MAH also leads to better robustness and performance.

* Accepted by ACL 2019

Via

Access Paper or Ask Questions

Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Jul 13, 2020

Ning Miao, Yuxuan Song, Hao Zhou, Lei Li

Figure 1 for Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Figure 2 for Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Figure 3 for Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Figure 4 for Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Abstract:It has been a common approach to pre-train a language model on a large corpus and fine-tune it on task-specific data. In practice, we observe that fine-tuning a pre-trained model on a small dataset may lead to over- and/or under-estimation problem. In this paper, we propose MC-Tailor, a novel method to alleviate the above issue in text generation tasks by truncating and transferring the probability mass from over-estimated regions to under-estimated ones. Experiments on a variety of text generation datasets show that MC-Tailor consistently and significantly outperforms the fine-tuning approach. Our code is available at this url.

* Accepted by ACL 2020

Via

Access Paper or Ask Questions

Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Jul 12, 2020

Yuxuan Song, Ning Miao, Hao Zhou, Lantao Yu, Mingxuan Wang, Lei Li

Figure 1 for Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Figure 2 for Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Figure 3 for Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Figure 4 for Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Abstract:Auto-regressive sequence generative models trained by Maximum Likelihood Estimation suffer the exposure bias problem in practical finite sample scenarios. The crux is that the number of training samples for Maximum Likelihood Estimation is usually limited and the input data distributions are different at training and inference stages. Many method shave been proposed to solve the above problem (Yu et al., 2017; Lu et al., 2018), which relies on sampling from the non-stationary model distribution and suffers from high variance or biased estimations. In this paper, we propose{\psi}-MLE, a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. We derive our algorithm from a new perspective of self-augmentation and introduce bias correction with density ratio estimation. Extensive experimental results on synthetic data and real-world text generation tasks demonstrate that our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.

* Accepted to International Conference on Artificial Intelligence and Statistics 2020

Via

Access Paper or Ask Questions

Kernelized Bayesian Softmax for Text Generation

Nov 01, 2019

Ning Miao, Hao Zhou, Chengqi Zhao, Wenxian Shi, Lei Li

Figure 1 for Kernelized Bayesian Softmax for Text Generation

Figure 2 for Kernelized Bayesian Softmax for Text Generation

Figure 3 for Kernelized Bayesian Softmax for Text Generation

Figure 4 for Kernelized Bayesian Softmax for Text Generation

Abstract:Neural models for text generation require a softmax layer with proper token embeddings during the decoding phase. Most existing approaches adopt single point embedding for each token. However, a word may have multiple senses according to different context, some of which might be distinct. In this paper, we propose KerBS, a novel approach for learning better embeddings for text generation. KerBS embodies two advantages: (a) it employs a Bayesian composition of embeddings for words with multiple senses; (b) it is adaptive to semantic variances of words and robust to rare sentence context by imposing learned kernels to capture the closeness of words (senses) in the embedding space. Empirical studies show that KerBS significantly boosts the performance of several text generation tasks.

Via

Access Paper or Ask Questions