To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing risk-consistent methods are theoretically grounded with high flexibility, they can learn only from two U sets. In this paper, we propose a new approach for binary classification from m U-sets for $m\ge2$. Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC), which is aimed at predicting from which U set each observed data is drawn. SSC can be solved by a standard (multi-class) classification method, and we use the SSC solution to obtain the final binary classifier through a certain linear-fractional transformation. We built our method in a flexible and efficient end-to-end deep learning framework and prove it to be classifier-consistent. Through experiments, we demonstrate the superiority of our proposed method over state-of-the-art methods.
Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use $L_{2}$ regularization as the default implementation of weight decay. \citet{loshchilov2018decoupled} demonstrated that $L_{2}$ regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including $L_{2}$ regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the $L_{2}$ regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over $L_{2}$ regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.
Deep learning is often criticized by two serious issues which rarely exist in natural nervous systems: overfitting and catastrophic forgetting. It can even memorize randomly labelled data, which has little knowledge behind the instance-label pairs. When a deep network continually learns over time by accommodating new tasks, it usually quickly overwrites the knowledge learned from previous tasks. Referred to as the neural variability, it is well-known in neuroscience that human brain reactions exhibit substantial variability even in response to the same stimulus. This mechanism balances accuracy and plasticity/flexibility in the motor learning of natural nervous systems. Thus it motivates us to design a similar mechanism named artificial neural variability (ANV), which helps artificial neural networks learn some advantages from "natural" neural networks. We rigorously prove that ANV plays as an implicit regularizer of the mutual information between the training data and the learned model. This result theoretically guarantees ANV a strictly improved generalizability, robustness to label noise, and robustness to catastrophic forgetting. We then devise a neural variable risk minimization (NVRM) framework and neural variable optimizers to achieve ANV for conventional network architectures in practice. The empirical studies demonstrate that NVRM can effectively relieve overfitting, label noise memorization, and catastrophic forgetting at negligible costs.
Labeling data is an unavoidable pre-processing procedure for most machine learning tasks. However, it takes a considerable amount of time, money, and labor to collect accurate \textit{explicit class labels} for a large dataset. A positivity comparison oracle has been adapted to relieve this burden, where two data points are received as input and the oracle answers which one is more likely to be positive. However, when information about the classification threshold is lacking, this oracle alone can at most rank all data points on the basis of their relative positivity; thus, it still needs to access explicit class labels. In order to harness pairwise comparisons in a more effective way, we propose an \textit{ambiguity comparison oracle}. This oracle also receives two data points as input, and it answers which one is more ambiguous, or more difficult to assign a label to. We then propose an efficient adaptive labeling algorithm that can actively query \textit{only pairwise comparison oracles} without accessing the explicit labeling oracle. We also address the situation where the labeling budget is insufficient compared to the dataset size, which can be dealt with by plugging the proposed algorithm into an active learning algorithm. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed labeling algorithms theoretically and empirically.
Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating training of deep neural networks. But Adam often generalizes significantly worse than Stochastic Gradient Descent (SGD). It is still mathematically unclear how Adaptive Learning Rate and Momentum affect saddle-point escaping and minima selection. Based on the diffusion theoretical framework, we separate the effects of Adaptive Learning Rate and Momentum on saddle-point escaping and minima selection. We find that SGD escapes saddle points very slowly along the directions of small-magnitude eigenvalues of the Hessian. We prove that Adaptive Learning Rate can make learning dynamics near saddle points approximately Hessian-independent, but cannot select flat minima as SGD does. In contrast, Momentum provides a momentum drift effect to help passing through saddle points, and almost does not affect flat minima selection. This mathematically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Motivated by the diffusion theoretical analysis, we design a novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD. Our real-world experiments demonstrate that Adai can converge similarly fast to Adam, but generalize significantly better. Adai even generalizes better than SGD, when converging fast to Adam is not required. The source is available to the public: \url{https://github.com/zeke-xie/adaptive-inertia-adai}.
Label disagreement between human experts is a common issue in the medical domain and poses unique challenges in the evaluation and learning of classification models. In this work, we extend metrics for probability prediction, including calibration, i.e., the reliability of predictive probability, to adapt to such a situation. We further formalize the metrics for higher-order statistics, including inter-rater disagreement, in a unified way, which enables us to assess the quality of distributional uncertainty. In addition, we propose a novel post-hoc calibration method that equips trained neural networks with calibrated distributions over class probability estimates. With a large-scale medical imaging application, we show that our approach significantly improves the quality of uncertainty estimates in multiple metrics.