Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Jiawei Ge, Shange Tang, Jianqing Fan, Cong Ma, Chi Jin

A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.

Via

Jiawei Ge, Shange Tang, Jianqing Fan, Chi Jin

Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited -- most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ''informative'' condition, our algorithm achieves an excess risk of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$ for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.

Via

Yuzheng Hu, Tianle Cai, Jinyong Shan, Shange Tang, Chaochao Cai, Ethan Song, Bo Li, Dawn Song

We consider vertical logistic regression (VLR) trained with mini-batch gradient descent -- a setting which has attracted growing interest among industries and proven to be useful in a wide range of applications including finance and medical research. We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks, where the protocols might differ between one another, yet a procedure of obtaining local gradients is implicitly shared. We first consider the honest-but-curious threat model, in which the detailed implementation of protocol is neglected and only the shared procedure is assumed, which we abstract as an oracle. We find that even under this general setting, single-dimension feature and label can still be recovered from the other party under suitable constraints of batch size, thus demonstrating the potential vulnerability of all frameworks following the same philosophy. Then we look into a popular instantiation of the protocol based on Homomorphic Encryption (HE). We propose an active attack that significantly weaken the constraints on batch size in the previous analysis via generating and compressing auxiliary ciphertext. To address the privacy leakage within the HE-based protocol, we develop a simple-yet-effective countermeasure based on Differential Privacy (DP), and provide both utility and privacy guarantees for the updated algorithm. Finally, we empirically verify the effectiveness of our attack and defense on benchmark datasets. Altogether, our findings suggest that all vertical federated learning frameworks that solely depend on HE might contain severe privacy risks, and DP, which has already demonstrated its power in horizontal federated learning, can also play a crucial role in the vertical setting, especially when coupled with HE or secure multi-party computation (MPC) techniques.

Via

Yuzheng Hu, Licong Lin, Shange Tang

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics---Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm---AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on CIFAR-10! This phenomenon casts doubt on the convention view that the square root is crucial and training without it will lead to terrible performance. As far as we have concerned, so long as the algorithm tries to explore second or even higher information of the loss surface, then proper scaling of the learning rate alone will guarantee fast training and good generalization performance. To the best of our knowledge, this is the first paper that seriously considers the necessity of square root among all adaptive methods. We believe that our work can shed light on the importance of higher-order information and inspire the design of more powerful algorithms in the future.

Via