We propose a new method for learning with multi-field categorical data. Multi-field categorical data are usually collected over many heterogeneous groups. These groups can reflect in the categories under a field. The existing methods try to learn a universal model that fits all data, which is challenging and inevitably results in learning a complex model. In contrast, we propose a field-wise learning method leveraging the natural structure of data to learn simple yet efficient one-to-one field-focused models with appropriate constraints. In doing this, the models can be fitted to each category and thus can better capture the underlying differences in data. We present a model that utilizes linear models with variance and low-rank constraints, to help it generalize better and reduce the number of parameters. The model is also interpretable in a field-wise manner. As the dimensionality of multi-field categorical data can be very high, the models applied to such data are mostly over-parameterized. Our theoretical analysis can potentially explain the effect of over-parametrization on the generalization of our model. It also supports the variance constraints in the learning objective. The experiment results on two large-scale datasets show the superior performance of our model, the trend of the generalization error bound, and the interpretability of learning outcomes. Our code is available at https://github.com/lzb5600/Field-wise-Learning.
Scene parsing from images is a fundamental yet challenging problem in visual content understanding. In this dense prediction task, the parsing model assigns every pixel to a categorical label, which requires the contextual information of adjacent image patches. So the challenge for this learning task is to simultaneously describe the geometric and semantic properties of objects or a scene. In this paper, we explore the effective use of multi-layer feature outputs of the deep parsing networks for spatial-semantic consistency by designing a novel feature aggregation module to generate the appropriate global representation prior, to improve the discriminative power of features. The proposed module can auto-select the intermediate visual features to correlate the spatial and semantic information. At the same time, the multiple skip connections form a strong supervision, making the deep parsing network easy to train. Extensive experiments on four public scene parsing datasets prove that the deep parsing network equipped with the proposed feature aggregation module can achieve very promising results.
Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.
We develop in this paper a framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may present in the response variable. The idea of EGM is to approximate the density function of the noise distribution instead of approximating the truth function directly as usual. Unlike the classical maximum likelihood estimation that encourages equal importance of all observations and could be problematic in the presence of abnormal observations, EGM schemes can be interpreted from a minimum distance estimation viewpoint and allow the ignorance of those observations. Furthermore, it is shown that several well-known robust nonconvex regression paradigms, such as Tukey regression and truncated least square regression, can be reformulated into this new framework. We then develop a learning theory for EGM, by means of which a unified analysis can be conducted for these well-established but not fully-understood regression approaches. Resulting from the new framework, a novel interpretation of existing bounded nonconvex loss functions can be concluded. Within this new framework, the two seemingly irrelevant terminologies, the well-known Tukey's biweight loss for robust regression and the triweight kernel for nonparametric smoothing, are closely related. More precisely, it is shown that the Tukey's biweight loss can be derived from the triweight kernel. Similarly, other frequently employed bounded nonconvex loss functions in machine learning such as the truncated square loss, the Geman-McClure loss, and the exponential squared loss can also be reformulated from certain smoothing kernels in statistics. In addition, the new framework enables us to devise new bounded nonconvex loss functions for robust learning.
As one of the triumphs and milestones of robust statistics, Huber regression plays an important role in robust inference and estimation. It has also been finding a great variety of applications in machine learning. In a parametric setup, it has been extensively studied. However, in the statistical learning context where a function is typically learned in a nonparametric way, there is still a lack of theoretical understanding of how Huber regression estimators learn the conditional mean function and why it works in the absence of light-tailed noise assumptions. To address these fundamental questions, we conduct an assessment of Huber regression from a statistical learning viewpoint. First, we show that the usual risk consistency property of Huber regression estimators, which is usually pursued in machine learning, cannot guarantee their learnability in mean regression. Second, we argue that Huber regression should be implemented in an adaptive way to perform mean regression, implying that one needs to tune the scale parameter in accordance with the sample size and the moment condition of the noise. Third, with an adaptive choice of the scale parameter, we demonstrate that Huber regression estimators can be asymptotic mean regression calibrated under $(1+\epsilon)$-moment conditions ($\epsilon>0$). Last but not least, under the same moment conditions, we establish almost sure convergence rates for Huber regression estimators. Note that the $(1+\epsilon)$-moment conditions accommodate the special case where the response variable possesses infinite variance and so the established convergence rates justify the robustness feature of Huber regression estimators. In the above senses, the present study provides a systematic statistical learning assessment of Huber regression estimators and justifies their merits in terms of robustness from a theoretical viewpoint.
Distributed machine learning systems have been receiving increasing attentions for their efficiency to process large scale data. Many distributed frameworks have been proposed for different machine learning tasks. In this paper, we study the distributed kernel regression via the divide and conquer approach. This approach has been proved asymptotically minimax optimal if the kernel is perfectly selected so that the true regression function lies in the associated reproducing kernel Hilbert space. However, this is usually, if not always, impractical because kernels that can only be selected via prior knowledge or a tuning process are hardly perfect. Instead it is more common that the kernel is good enough but imperfect in the sense that the true regression can be well approximated by but does not lie exactly in the kernel space. We show distributed kernel regression can still achieves capacity independent optimal rate in this case. To this end, we first establish a general framework that allows to analyze distributed regression with response weighted base algorithms by bounding the error of such algorithms on a single data set, provided that the error bounds has factored the impact of the unexplained variance of the response variable. Then we perform a leave one out analysis of the kernel ridge regression and bias corrected kernel ridge regression, which in combination with the aforementioned framework allows us to derive sharp error bounds and capacity independent optimal rates for the associated distributed kernel regression algorithms. As a byproduct of the thorough analysis, we also prove the kernel ridge regression can achieve rates faster than $N^{-1}$ (where $N$ is the sample size) in the noise free setting which, to our best knowledge, are first observed and novel in regression learning.
The challenges of high intra-class variance yet low inter-class fluctuations in fine-grained visual categorization are more severe with few labeled samples, \textit{i.e.,} Fine-Grained categorization problems under the Few-Shot setting (FGFS). High-order features are usually developed to uncover subtle differences between sub-categories in FGFS, but they are less effective in handling the high intra-class variance. In this paper, we propose a Target-Oriented Alignment Network (TOAN) to investigate the fine-grained relation between the target query image and support classes. The feature of each support image is transformed to match the query ones in the embedding feature space, which reduces the disparity explicitly within each category. Moreover, different from existing FGFS approaches devise the high-order features over the global image with less explicit consideration of discriminative parts, we generate discriminative fine-grained features by integrating compositional concept representations to global second-order pooling. Extensive experiments are conducted on four fine-grained benchmarks to demonstrate the effectiveness of TOAN compared with the state-of-the-art models.
Distributed training is useful to train complicated models to shorten the training time. As each of the workers only sees a small fraction of data, workers need to synchronize on the parameter updates. One of the central questions in distributed training is how to parsimoniously synchronize parameters while preserving model quality. To address this problem, we propose the \textbf{ShadowSync} framework, in which we isolate synchronization from training and run it in the background. In contrast to common strategies including synchronous stochastic gradient descent (SGD), asynchronous SGD, and model averaging on independently trained sub-models, where synchronization happens in the foreground, ShadowSync synchronization is neither part of the backward pass, nor happens every $k$ iterations. Our framework is generic to host various types of synchronization algorithms, and we propose 3 approaches under this theme. The superiority of ShadowSync is confirmed by experiments on training deep neural networks for click-through-rate prediction. Our methods all succeed in making the training throughput linearly scale with the number of trainers. Comparing to their foreground counterparts, our methods exhibit neutral to better model quality and better scalability when we keep the number of parameter servers the same. In our training system which expresses both replication and Hogwild parallelism, ShadowSync also accomplishes the highest example level parallelism number comparing to the prior arts.
Cross-domain person re-identification (re-ID) is challenging due to the bias between training and testing domains. We observe that if backgrounds in the training and testing datasets are very different, it dramatically introduces difficulties to extract robust pedestrian features, and thus compromises the cross-domain person re-ID performance. In this paper, we formulate such problems as a background shift problem. A Suppression of Background Shift Generative Adversarial Network (SBSGAN) is proposed to generate images with suppressed backgrounds. Unlike simply removing backgrounds using binary masks, SBSGAN allows the generator to decide whether pixels should be preserved or suppressed to reduce segmentation errors caused by noisy foreground masks. Additionally, we take ID-related cues, such as vehicles and companions into consideration. With high-quality generated images, a Densely Associated 2-Stream (DA-2S) network is introduced with Inter Stream Densely Connection (ISDC) modules to strengthen the complementarity of the generated data and ID-related cues. The experiments show that the proposed method achieves competitive performance on three re-ID datasets, ie., Market-1501, DukeMTMC-reID, and CUHK03, under the cross-domain person re-ID scenario.
Deep neural networks have demonstrated advanced abilities on various visual classification tasks, which heavily rely on the large-scale training samples with annotated ground-truth. However, it is unrealistic always to require such annotation in real-world applications. Recently, Few-Shot learning (FS), as an attempt to address the shortage of training samples, has made significant progress in generic classification tasks. Nonetheless, it is still challenging for current FS models to distinguish the subtle differences between fine-grained categories given limited training data. To filling the classification gap, in this paper, we address the Few-Shot Fine-Grained (FSFG) classification problem, which focuses on tackling the fine-grained classification under the challenging few-shot learning setting. A novel low-rank pairwise bilinear pooling operation is proposed to capture the nuanced differences between the support and query images for learning an effective distance metric. Moreover, a feature alignment layer is designed to match the support image features with query ones before the comparison. We name the proposed model Low-Rank Pairwise Alignment Bilinear Network (LRPABN), which is trained in an end-to-end fashion. Comprehensive experimental results on four widely used fine-grained classification datasets demonstrate that our LRPABN model achieves the superior performances compared to state-of-the-art methods.