Abstract:We study distributionally robust optimization with Sinkorn distance -- a variant of Wasserstein distance based on entropic regularization. We derive convex programming dual reformulations when the nominal distribution is an empirical distribution and a general distribution, respectively. Compared with Wasserstein DRO, it is computationally tractable for a larger class of loss functions, and its worst-case distribution is more reasonable. To solve the dual reformulation, we propose an efficient batch gradient descent with a bisection search algorithm. Finally, we provide various numerical examples using both synthetic and real data to demonstrate its competitive performance.
Abstract:This paper is concerned with regularized extensions of hierarchical non-stationary temporal Gaussian processes (NSGPs) in which the parameters (e.g., length-scale) are modeled as GPs. In particular, we consider two commonly used NSGP constructions which are based on explicitly constructed non-stationary covariance functions and stochastic differential equations, respectively. We extend these NSGPs by including $L^1$-regularization on the processes in order to induce sparseness. To solve the resulting regularized NSGP (R-NSGP) regression problem we develop a method based on the alternating direction method of multipliers (ADMM) and we also analyze its convergence properties theoretically. We also evaluate the performance of the proposed methods in simulated and real-world datasets.
Abstract:Understanding the generalization capability of learning algorithms is at the heart of statistical learning theory. In this paper, we investigate the generalization gap of stochastic gradient Langevin dynamics (SGLD), a widely used optimizer for training deep neural networks (DNNs). We derive an algorithm-dependent generalization bound by analyzing SGLD through an information-theoretic lens. Our analysis reveals an intricate trade-off between learning and information dissipation: SGLD learns from data by updating parameters at each iteration while dissipating information from early training stages. Our bound also involves the variance of gradients which captures a particular kind of "sharpness" of the loss landscape. The main proof techniques in this paper rely on strong data processing inequalities -- a fundamental concept in information theory -- and Otto-Villani's HWI inequality. Finally, we demonstrate our bound through numerical experiments, showing that it can predict the behavior of the true generalization gap.
Abstract:Deep segmentation models that generalize to images with unknown appearance are important for real-world medical image analysis. Retraining models leads to high latency and complex pipelines, which are impractical in clinical settings. The situation becomes more severe for ultrasound image analysis because of their large appearance shifts. In this paper, we propose a novel method for robust segmentation under unknown appearance shifts. Our contribution is three-fold. First, we advance a one-stage plug-and-play solution by embedding hierarchical style transfer units into a segmentation architecture. Our solution can remove appearance shifts and perform segmentation simultaneously. Second, we adopt Dynamic Instance Normalization to conduct precise and dynamic style transfer in a learnable manner, rather than previously fixed style normalization. Third, our solution is fast and lightweight for routine clinical adoption. Given 400*400 image input, our solution only needs an additional 0.2ms and 1.92M FLOPs to handle appearance shifts compared to the baseline pipeline. Extensive experiments are conducted on a large dataset from three vendors demonstrate our proposed method enhances the robustness of deep segmentation models.
Abstract:In a sequential decision-making problem, off-policy evaluation (OPE) estimates the expected cumulative reward of a target policy using logged transition data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates with statistical guarantees and develop non-asymptotic as well as asymptotic confidence intervals for OPE, leveraging methodologies from distributionally robust optimization. Our theoretical results are also supported by empirical analysis.
Abstract:We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. In particular, we aim to circumvent the curse of dimensionality in Wasserstein distance: when the dimension is high, it has diminishing testing power, which is inherently due to the slow concentration property of Wasserstein metrics in the high dimension space. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions. We characterize the theoretical property of the finite-sample convergence rate on IPMs and present practical algorithms for computing this metric. Numerical examples validate our theoretical results.
Abstract:Ultrasound (US) image segmentation embraced its significant improvement in deep learning era. However, the lack of sharp boundaries in US images still remains an inherent challenge for segmentation. Previous methods often resort to global context, multi-scale cues or auxiliary guidance to estimate the boundaries. It is hard for these methods to approach pixel-level learning for fine-grained boundary generating. In this paper, we propose a novel and effective framework to improve boundary estimation in US images. Our work has three highlights. First, we propose to formulate the boundary estimation as a rendering task, which can recognize ambiguous points (pixels/voxels) and calibrate the boundary prediction via enriched feature representation learning. Second, we introduce point-wise contrastive learning to enhance the similarity of points from the same class and contrastively decrease the similarity of points from different classes. Boundary ambiguities are therefore further addressed. Third, both rendering and contrastive learning tasks contribute to consistent improvement while reducing network parameters. As a proof-of-concept, we performed validation experiments on a challenging dataset of 86 ovarian US volumes. Results show that our proposed method outperforms state-of-the-art methods and has the potential to be used in clinical practice.
Abstract:Wasserstein distributionally robust optimization (DRO) aims to find robust and generalizable solutions by hedging against data perturbations in Wasserstein distance. Despite its recent empirical success in operations research and machine learning, existing performance guarantees for generic loss functions are either overly conservative due to the curse of dimensionality, or plausible only in large sample asymptotics. In this paper, we develop a non-asymptotic framework for analyzing the out-of-sample performance for Wasserstein robust learning and the generalization bound for its related Lipschitz and gradient regularization problems. To the best of our knowledge, this gives the first finite-sample guarantee for generic Wasserstein DRO problems without suffering from the curse of dimensionality. Our results highlight the bias-variation trade-off intrinsic in the Wasserstein DRO, which automatically balances between the empirical mean of the loss and the variation of the loss, measured by the Lipschitz norm or the gradient norm of the loss. Our analysis is based on two novel methodological developments which are of independent interest: 1) a new concentration inequality characterizing the decay rate of large deviation probabilities by the variation of the loss and, 2) a localized Rademacher complexity theory based on the variation of the loss.
Abstract:Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research in few-shot classification has been based on metric learning to capture similarities between samples and then perform the $k$-nearest neighbor algorithm. To make such an algorithm more robust, in this paper, we propose a distributionally robust $k$-nearest neighbor algorithm Dr.k-NN, which features assigning minimax optimal weights to training samples when performing classification. We also couple it with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm comparing to the state-of-the-art in the few-shot learning setting with various real-data experiments.
Abstract:Deep Neural Networks (DNNs) suffer from the performance degradation when image appearance shift occurs, especially in ultrasound (US) image segmentation. In this paper, we propose a novel and intuitive framework to remove the appearance shift, and hence improve the generalization ability of DNNs. Our work has three highlights. First, we follow the spirit of universal style transfer to remove appearance shifts, which was not explored before for US images. Without sacrificing image structure details, it enables the arbitrary style-content transfer. Second, accelerated with Adaptive Instance Normalization block, our framework achieved real-time speed required in the clinical US scanning. Third, an efficient and effective style image selection strategy is proposed to ensure the target-style US image and testing content US image properly match each other. Experiments on two large US datasets demonstrate that our methods are superior to state-of-the-art methods on making DNNs robust against various appearance shifts.