Get our free extension to see links to code for papers anywhere online!Free extension: code links for papers anywhere!Free add-on: See code for papers anywhere!

Jiawei Ge, Shange Tang, Jianqing Fan, Cong Ma, Chi Jin

A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.

Via

Jianqing Fan, Zhaoran Wang, Zhuoran Yang, Chenlu Ye

We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.

Via

Jianqing Fan, Cheng Gao, Jason M. Klusowski

This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning in terms of risk improvements. Our proposed ''Transfer Around Boundary'' (TAB) model, with a threshold balancing the performance of target and source data, is shown to be both efficient and robust, improving classification while avoiding negative transfer. Moreover, we demonstrate the effectiveness of the TAB model on non-parametric classification and logistic regression tasks, achieving upper bounds which are optimal up to logarithmic factors. Simulation studies lend further support to the effectiveness of TAB. We also provide simple approaches to bound the excess misclassification error without the need for specialized knowledge in transfer learning.

Via

Sohom Bhattacharya, Jianqing Fan, Jikai Hou

Network data is prevalent in numerous big data applications including economics and health networks where it is of prime importance to understand the latent structure of network. In this paper, we model the network using the Degree-Corrected Mixed Membership (DCMM) model. In DCMM model, for each node $i$, there exists a membership vector $\boldsymbol{\pi}_ i = (\boldsymbol{\pi}_i(1), \boldsymbol{\pi}_i(2),\ldots, \boldsymbol{\pi}_i(K))$, where $\boldsymbol{\pi}_i(k)$ denotes the weight that node $i$ puts in community $k$. We derive novel finite-sample expansion for the $\boldsymbol{\pi}_i(k)$s which allows us to obtain asymptotic distributions and confidence interval of the membership mixing probabilities and other related population quantities. This fills an important gap on uncertainty quantification on the membership profile. We further develop a ranking scheme of the vertices based on the membership mixing probabilities on certain communities and perform relevant statistical inferences. A multiplier bootstrap method is proposed for ranking inference of individual member's profile with respect to a given community. The validity of our theoretical results is further demonstrated by via numerical experiments in both real and synthetic data examples.

Via

Jianqing Fan, Zhipeng Lou, Weichen Wang, Mengxin Yu

This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a very general and more realistic setup in which the comparison graph consists of hyper-edges of possible heterogeneous sizes and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly-used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in the scenarios when the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that it is the first time effective two-sample rank testing methods are proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences on statistics journals and movie rankings.

Via

Jianqing Fan, Jiawei Ge, Debarghya Mukherjee

Uncertainty quantification for prediction is an intriguing problem with significant applications in various fields, such as biomedical science, economic studies, and weather forecasts. Numerous methods are available for constructing prediction intervals, such as quantile regression and conformal predictions, among others. Nevertheless, model misspecification (especially in high-dimension) or sub-optimal constructions can frequently result in biased or unnecessarily-wide prediction intervals. In this paper, we propose a novel and widely applicable technique for aggregating multiple prediction intervals to minimize the average width of the prediction band along with coverage guarantee, called Universally Trainable Optimal Predictive Intervals Aggregation (UTOPIA). The method also allows us to directly construct predictive bands based on elementary basis functions. Our approach is based on linear or convex programming which is easy to implement. All of our proposed methodologies are supported by theoretical guarantees on the coverage probability and optimal average length, which are detailed in this paper. The effectiveness of our approach is convincingly demonstrated by applying it to synthetic data and two real datasets on finance and macroeconomics.

Via

Yuling Yan, Weijie J. Su, Jianqing Fan

In 2023, the International Conference on Machine Learning (ICML) required authors with multiple submissions to rank their submissions based on perceived quality. In this paper, we aim to employ these author-specified rankings to enhance peer review in machine learning and artificial intelligence conferences by extending the Isotonic Mechanism (Su, 2021, 2022) to exponential family distributions. This mechanism generates adjusted scores closely align with the original scores while adhering to author-specified rankings. Despite its applicability to a broad spectrum of exponential family distributions, this mechanism's implementation does not necessitate knowledge of the specific distribution form. We demonstrate that an author is incentivized to provide accurate rankings when her utility takes the form of a convex additive function of the adjusted review scores. For a certain subclass of exponential family distributions, we prove that the author reports truthfully only if the question involves only pairwise comparisons between her submissions, thus indicating the optimality of ranking in truthful information elicitation. Lastly, we show that the adjusted scores improve dramatically the accuracy of the original scores and achieve nearly minimax optimality for estimating the true scores with statistical consistecy when true scores have bounded total variation.

Via

Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan

This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon non-stationary Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Via