We study online classification in the presence of noisy labels. The noise mechanism is modeled by a general kernel that specifies, for any feature-label pair, a (known) set of distributions over noisy labels. At each time step, an adversary selects an unknown distribution from the distribution set specified by the kernel based on the actual feature-label pair, and generates the noisy label from the selected distribution. The learner then makes a prediction based on the actual features and noisy labels observed thus far, and incurs loss $1$ if the prediction differs from the underlying truth (and $0$ otherwise). The prediction quality is quantified through minimax risk, which computes the cumulative loss over a finite horizon $T$. We show that for a wide range of natural noise kernels, adversarially selected features, and finite class of labeling functions, minimax risk can be upper bounded independent of the time horizon and logarithmic in the size of labeling function class. We then extend these results to inifinite classes and stochastically generated features via the concept of stochastic sequential covering. Our results extend and encompass findings of Ben-David et al. (2009) through substantial generality, and provide intuitive understanding through a novel reduction to online conditional distribution estimation.
We study the problem of online learning and online regret minimization when samples are drawn from a general unknown non-stationary process. We introduce the concept of a dynamic changing process with cost $K$, where the conditional marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\mathsf{VC}(\mathcal{H})\log T})$ for the expected worst case regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\mathsf{VC}(\mathcal{H})\log^3 T)$. We extend these results to general smooth adversary processes with unknown reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the distribution blind (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.
We study the problem of sequential prediction and online minimax regret with stochastically generated features under a general loss function. We introduce a notion of expected worst case minimax regret that generalizes and encompasses prior known minimax regrets. For such minimax regrets we establish tight upper bounds via a novel concept of stochastic global sequential covering. We show that for a hypothesis class of VC-dimension $\mathsf{VC}$ and $i.i.d.$ generated features of length $T$, the cardinality of the stochastic global sequential covering can be upper bounded with high probability (whp) by $e^{O(\mathsf{VC} \cdot \log^2 T)}$. We then improve this bound by introducing a new complexity measure called the Star-Littlestone dimension, and show that classes with Star-Littlestone dimension $\mathsf{SL}$ admit a stochastic global sequential covering of order $e^{O(\mathsf{SL} \cdot \log T)}$. We further establish upper bounds for real valued classes with finite fat-shattering numbers. Finally, by applying information-theoretic tools of the fixed design minimax regrets, we provide lower bounds for the expected worst case minimax regret. We demonstrate the effectiveness of our approach by establishing tight bounds on the expected worst case minimax regrets for logarithmic loss and general mixable losses.
We study the sequential general online regression, known also as the sequential probability assignments, under logarithmic loss when compared against a broad class of experts. We focus on obtaining tight, often matching, lower and upper bounds for the sequential minimax regret that are defined as the excess loss it incurs over a class of experts. After proving a general upper bound, we consider some specific classes of experts from Lipschitz class to bounded Hessian class and derive matching lower and upper bounds with provably optimal constants. Our bounds work for a wide range of values of the data dimension and the number of rounds. To derive lower bounds, we use tools from information theory (e.g., Shtarkov sum) and for upper bounds, we resort to new "smooth truncated covering" of the class of experts. This allows us to find constructive proofs by applying a simple and novel truncated Bayesian algorithm. Our proofs are substantially simpler than the existing ones and yet provide tighter (and often optimal) bounds.
Structural causal models (SCMs) provide a principled approach to identifying causation from observational and experimental data in disciplines ranging from economics to medicine. SCMs, however, require domain knowledge, which is typically represented as graphical models. A key challenge in this context is the absence of a methodological framework for encoding priors (background knowledge) into causal models in a systematic manner. We propose an abstraction called causal knowledge hierarchy (CKH) for encoding priors into causal models. Our approach is based on the foundation of "levels of evidence" in medicine, with a focus on confidence in causal information. Using CKH, we present a methodological framework for encoding causal priors from various data sources and combining them to derive an SCM. We evaluate our approach on a simulated dataset and demonstrate overall performance compared to the ground truth causal model with sensitivity analysis.
There has been significant recent interest in quantum neural networks (QNNs), along with their applications in diverse domains. Current solutions for QNNs pose significant challenges concerning their scalability, ensuring that the postulates of quantum mechanics are satisfied and that the networks are physically realizable. The exponential state space of QNNs poses challenges for the scalability of training procedures. The no-cloning principle prohibits making multiple copies of training samples, and the measurement postulates lead to non-deterministic loss functions. Consequently, the physical realizability and efficiency of existing approaches that rely on repeated measurement of several copies of each sample for training QNNs are unclear. This paper presents a new model for QNNs that relies on band-limited Fourier expansions of transfer functions of quantum perceptrons (QPs) to design scalable training procedures. This training procedure is augmented with a randomized quantum stochastic gradient descent technique that eliminates the need for sample replication. We show that this training procedure converges to the true minima in expectation, even in the presence of non-determinism due to quantum measurement. Our solution has a number of important benefits: (i) using QPs with concentrated Fourier power spectrum, we show that the training procedure for QNNs can be made scalable; (ii) it eliminates the need for resampling, thus staying consistent with the no-cloning rule; and (iii) enhanced data efficiency for the overall training process since each data sample is processed once per epoch. We present a detailed theoretical foundation for our models and methods' scalability, accuracy, and data efficiency. We also validate the utility of our approach through a series of numerical experiments.
Multiple network alignment is the problem of identifying similar and related regions in a given set of networks. While there are a large number of effective techniques for pairwise problems with two networks that scale in terms of edges, these cannot be readily extended to align multiple networks as the computational complexity will tend to grow exponentially with the number of networks.In this paper we introduce a new multiple network alignment algorithm and framework that is effective at aligning thousands of networks with thousands of nodes. The key enabling technique of our algorithm is identifying an exact and easy to compute low-rank tensor structure inside of a principled heuristic procedure for pairwise network alignment called IsoRank. This can be combined with a new algorithm for $k$-dimensional matching problems on low-rank tensors to produce the alignment. We demonstrate results on synthetic and real-world problems that show our technique (i) is as good or better in terms of quality as existing methods, when they work on small problems, while running considerably faster and (ii) is able to scale to aligning a number of networks unreachable by current methods. We show in this paper that our method is the realistic choice for aligning multiple networks when no prior information is present.
Convex optimization problems arise frequently in diverse machine learning (ML) applications. First-order methods, i.e., those that solely rely on the gradient information, are most commonly used to solve these problems. This choice is motivated by their simplicity and low per-iteration cost. Second-order methods that rely on curvature information through the dense Hessian matrix have, thus far, proven to be prohibitively expensive at scale, both in terms of computational and memory requirements. We present a novel multi-GPU distributed formulation of a second order (Newton-type) solver for convex finite sum minimization problems for multi-class classification. Our distributed formulation relies on the Alternating Direction of Multipliers Method (ADMM), which requires only one round of communication per-iteration -- significantly reducing communication overheads, while incurring minimal convergence overhead. By leveraging the computational capabilities of GPUs, we demonstrate that per-iteration costs of Newton-type methods can be significantly reduced to be on-par with, if not better than, state-of-the-art first-order alternatives. Given their significantly faster convergence rates, we demonstrate that our methods can process large data-sets in much shorter time (orders of magnitude in many cases) compared to existing first and second order methods, while yielding similar test-accuracy results.
Recent neuroimaging studies have shown that functional connectomes are unique to individuals, i.e., two distinct fMRIs taken over different sessions of the same subject are more similar in terms of their connectomes than those from two different subjects. In this study, we present significant new results that identify specific parts of the connectome that code the unique signatures. We show that a very small part of the connectome (under 100 features from among over 64K total features) is responsible for the signatures. A network of these features is shown to achieve excellent training and test accuracy in matching imaging datasets. We show that these features are statistically significant, robust to perturbations, invariant across populations, and are localized to a small number of structural regions of the brain (12 regions from among 180). We develop an innovative matrix sampling technique to derive computationally efficient and accurate methods for identifying the discriminating sub-connectome and support all of our claims using state of the art statistical tests and computational techniques.