Alert button
Picture for Xiang Ji

Xiang Ji

Alert button

Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks

Oct 16, 2023
Zihao Li, Xiang Ji, Minshuo Chen, Mengdi Wang

A recently popular approach to solving reinforcement learning is with data from human preferences. In fact, human preference data are now used with classic reinforcement learning algorithms such as actor-critic methods, which involve evaluating an intermediate policy over a reward learned from human preference data with distribution shift, known as off-policy evaluation (OPE). Such algorithm includes (i) learning reward function from human preference dataset, and (ii) learning expected cumulative reward of a target policy. Despite the huge empirical success, existing OPE methods with preference data often lack theoretical understanding and rely heavily on heuristics. In this paper, we study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. Specifically, we approach OPE by learning the value function by fitted-Q-evaluation with a deep neural network. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Under the assumption of high reward smoothness, our results \textit{almost align with the classical OPE results with observable reward data}. To the best of our knowledge, this is the first result that establishes a \textit{provably efficient} guarantee for off-policy evaluation with RLHF.

Viaarxiv icon

Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning

Oct 15, 2023
Yulong Yang, Chenhao Lin, Xiang Ji, Qiwei Tian, Qian Li, Hongshan Yang, Zhibo Wang, Chao Shen

Figure 1 for Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning
Figure 2 for Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning
Figure 3 for Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning
Figure 4 for Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning

Transfer-based adversarial attacks raise a severe threat to real-world deep learning systems since they do not require access to target models. Adversarial training (AT), which is recognized as the strongest defense against white-box attacks, has also guaranteed high robustness to (black-box) transfer-based attacks. However, AT suffers from heavy computational overhead since it optimizes the adversarial examples during the whole training process. In this paper, we demonstrate that such heavy optimization is not necessary for AT against transfer-based attacks. Instead, a one-shot adversarial augmentation prior to training is sufficient, and we name this new defense paradigm Data-centric Robust Learning (DRL). Our experimental results show that DRL outperforms widely-used AT techniques (e.g., PGD-AT, TRADES, EAT, and FAT) in terms of black-box robustness and even surpasses the top-1 defense on RobustBench when combined with diverse data augmentations and loss regularizations. We also identify other benefits of DRL, for instance, the model generalization capability and robust fairness.

* 9 pages 
Viaarxiv icon

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds

Sep 25, 2023
Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao

Policy-based algorithms equipped with deep neural networks have achieved great success in solving high-dimensional policy optimization problems in reinforcement learning. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with convolutional neural networks (CNN) as function approximators. Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, providing an explanation for the efficacy of deep policy-based algorithms.

Viaarxiv icon

Hard Adversarial Example Mining for Improving Robust Fairness

Aug 03, 2023
Chenhao Lin, Xiang Ji, Yulong Yang, Qian Li, Chao Shen, Run Wang, Liming Fang

Figure 1 for Hard Adversarial Example Mining for Improving Robust Fairness
Figure 2 for Hard Adversarial Example Mining for Improving Robust Fairness
Figure 3 for Hard Adversarial Example Mining for Improving Robust Fairness
Figure 4 for Hard Adversarial Example Mining for Improving Robust Fairness

Adversarial training (AT) is widely considered the state-of-the-art technique for improving the robustness of deep neural networks (DNNs) against adversarial examples (AE). Nevertheless, recent studies have revealed that adversarially trained models are prone to unfairness problems, restricting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To alleviate this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Besides, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. Extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.

Viaarxiv icon

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

Jul 24, 2023
Xiang Ji, Huazheng Wang, Minshuo Chen, Tuo Zhao, Mengdi Wang

A crucial task in decision-making problems is reward engineering. It is common in practice that no obvious choice of reward function exists. Thus, a popular approach is to introduce human feedback during training and leverage such feedback to learn a reward function. Among all policy learning methods that use human feedback, preference-based methods have demonstrated substantial success in recent empirical applications such as InstructGPT. In this work, we develop a theory that provably shows the benefits of preference-based methods in offline contextual bandits. In particular, we improve the modeling and suboptimality analysis for running policy learning methods on human-scored samples directly. Then, we compare it with the suboptimality guarantees of preference-based methods and show that preference-based methods enjoy lower suboptimality.

Viaarxiv icon

Cross-modal Place Recognition in Image Databases using Event-based Sensors

Jul 03, 2023
Xiang Ji, Jiaxin Wei, Yifu Wang, Huiliang Shang, Laurent Kneip

Figure 1 for Cross-modal Place Recognition in Image Databases using Event-based Sensors
Figure 2 for Cross-modal Place Recognition in Image Databases using Event-based Sensors
Figure 3 for Cross-modal Place Recognition in Image Databases using Event-based Sensors
Figure 4 for Cross-modal Place Recognition in Image Databases using Event-based Sensors

Visual place recognition is an important problem towards global localization in many robotics tasks. One of the biggest challenges is that it may suffer from illumination or appearance changes in surrounding environments. Event cameras are interesting alternatives to frame-based sensors as their high dynamic range enables robust perception in difficult illumination conditions. However, current event-based place recognition methods only rely on event information, which restricts downstream applications of VPR. In this paper, we present the first cross-modal visual place recognition framework that is capable of retrieving regular images from a database given an event query. Our method demonstrates promising results with respect to the state-of-the-art frame-based and event-based methods on the Brisbane-Event-VPR dataset under different scenarios. We also verify the effectiveness of the combination of retrieval and classification, which can boost performance by a large margin.

Viaarxiv icon

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

May 24, 2023
Xiang Ji, Gen Li

Figure 1 for Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.

Viaarxiv icon

Blur Interpolation Transformer for Real-World Motion from Blur

Nov 21, 2022
Zhihang Zhong, Mingdeng Cao, Xiang Ji, Yinqiang Zheng, Imari Sato

Figure 1 for Blur Interpolation Transformer for Real-World Motion from Blur
Figure 2 for Blur Interpolation Transformer for Real-World Motion from Blur
Figure 3 for Blur Interpolation Transformer for Real-World Motion from Blur
Figure 4 for Blur Interpolation Transformer for Real-World Motion from Blur

This paper studies the challenging problem of recovering motion from blur, also known as joint deblurring and interpolation or blur temporal super-resolution. The remaining challenges are twofold: 1) the current methods still leave considerable room for improvement in terms of visual quality even on the synthetic dataset, and 2) poor generalization to real-world data. To this end, we propose a blur interpolation transformer (BiT) to effectively unravel the underlying temporal correlation encoded in blur. Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies to generate effective features for time-varying motion rendering. In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs. Experimental results show that BiT has a significant gain over the state-of-the-art methods on the public dataset Adobe240. Besides, the proposed real-world dataset effectively helps the model generalize well to real blurry scenarios.

Viaarxiv icon

Speeding Up Question Answering Task of Language Models via Inverted Index

Oct 24, 2022
Xiang Ji, Yesim Sungu-Eryilmaz, Elaheh Momeni, Reza Rawassizadeh

Figure 1 for Speeding Up Question Answering Task of Language Models via Inverted Index
Figure 2 for Speeding Up Question Answering Task of Language Models via Inverted Index
Figure 3 for Speeding Up Question Answering Task of Language Models via Inverted Index
Figure 4 for Speeding Up Question Answering Task of Language Models via Inverted Index

Natural language processing applications, such as conversational agents and their question-answering capabilities, are widely used in the real world. Despite the wide popularity of large language models (LLMs), few real-world conversational agents take advantage of LLMs. Extensive resources consumed by LLMs disable developers from integrating them into end-user applications. In this study, we leverage an inverted indexing mechanism combined with LLMs to improve the efficiency of question-answering models for closed-domain questions. Our experiments show that using the index improves the average response time by 97.44%. In addition, due to the reduced search scope, the average BLEU score improved by 0.23 while using the inverted index.

Viaarxiv icon

Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks

Jun 06, 2022
Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao

Figure 1 for Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks
Figure 2 for Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks
Figure 3 for Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks
Figure 4 for Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks

We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.

Viaarxiv icon