Few-shot learning is challenging due to its very limited data and labels. Recent studies in big transfer (BiT) show that few-shot learning can greatly benefit from pretraining on large scale labeled dataset in a different domain. This paper asks a more challenging question: "can we use as few as possible labels for few-shot learning in both pretraining (with no labels) and fine-tuning (with fewer labels)?". Our key insight is that the clustering of target samples in the feature space is all we need for few-shot finetuning. It explains why the vanilla unsupervised pretraining (poor clustering) is worse than the supervised one. In this paper, we propose transductive unsupervised pretraining that achieves a better clustering by involving target data even though its amount is very limited. The improved clustering result is of great value for identifying the most representative samples ("eigen-samples") for users to label, and in return, continued finetuning with the labeled eigen-samples further improves the clustering. Thus, we propose eigen-finetuning to enable fewer shot learning by leveraging the co-evolution of clustering and eigen-samples in the finetuning. We conduct experiments on 10 different few-shot target datasets, and our average few-shot performance outperforms both vanilla inductive unsupervised transfer and supervised transfer by a large margin. For instance, when each target category only has 10 labeled samples, the mean accuracy gain over the above two baselines is 9.2% and 3.42 respectively.
While benefiting people's daily life in so many ways, smartphones and their location-based services are generating massive mobile device location data that has great potential to help us understand travel demand patterns and make transportation planning for the future. While recent studies have analyzed human travel behavior using such new data sources, limited research has been done to extract multimodal travel demand patterns out of them. This paper presents a data-driven analytical framework to bridge the gap. To be able to successfully detect travel modes using the passively collected location information, we conduct a smartphone-based GPS survey to collect ground truth observations. Then a jointly trained single-layer model and deep neural network for travel mode imputation is developed. Being "wide" and "deep" at the same time, this model combines the advantages of both types of models. The framework also incorporates the multimodal transportation network in order to evaluate the closeness of trip routes to the nearby rail, metro, highway and bus lines and therefore enhance the imputation accuracy. To showcase the applications of the introduced framework in answering real-world planning needs, a separate mobile device location data is processed through trip end identification and attribute generation, in a way that the travel mode imputation can be directly applied. The estimated multimodal travel demand patterns are then validated against typical household travel surveys in the same Washington D.C. and Baltimore Metropolitan Regions.
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.
In this paper, we present a large scale unlabeled person re-identification (Re-ID) dataset "LUPerson" and make the first attempt of performing unsupervised pre-training for improving the generalization ability of the learned person Re-ID feature representation. This is to address the problem that all existing person Re-ID datasets are all of limited scale due to the costly effort required for data annotation. Previous research tries to leverage models pre-trained on ImageNet to mitigate the shortage of person Re-ID data but suffers from the large domain gap between ImageNet and person Re-ID data. LUPerson is an unlabeled dataset of 4M images of over 200K identities, which is 30X larger than the largest existing Re-ID dataset. It also covers a much diverse range of capturing environments (eg, camera settings, scenes, etc.). Based on this dataset, we systematically study the key factors for learning Re-ID features from two perspectives: data augmentation and contrastive loss. Unsupervised pre-training performed on this large-scale dataset effectively leads to a generic Re-ID feature that can benefit all existing person Re-ID methods. Using our pre-trained model in some basic frameworks, our methods achieve state-of-the-art results without bells and whistles on four widely used Re-ID datasets: CUHK03, Market1501, DukeMTMC, and MSMT17. Our results also show that the performance improvement is more significant on small-scale target datasets or under few-shot setting.
Despite the great success of deep model on Hyperspectral imagery (HSI) super-resolution(SR) for simulated data, most of them function unsatisfactory when applied to the real data, especially for unsupervised HSI SR methods. One of the main reason comes from the fact that the predefined degeneration models (e.g. blur in spatial domain) utilized by most HSI SR methods often exist great discrepancy with the real one, which results in these deep models overfit and ultimately degrade their performance on real data. To well mitigate such a problem, we explore the unsupervised blind HSI SR method. Specifically, we investigate how to effectively obtain the degeneration models in spatial and spectral domain, respectively, and makes them can well compatible with the fusion based SR reconstruction model. To this end, we first propose an alternating optimization based deep framework to estimate the degeneration models and reconstruct the latent image, with which the degeneration models estimation and HSI reconstruction can mutually promotes each other. Then, a meta-learning based mechanism is further proposed to pre-train the network, which can effectively improve the speed and generalization ability adapting to different complex degeneration. Experiments on three benchmark HSI SR datasets report an excellent superiority of the proposed method on handling blind HSI fusion problem over other competing methods.
Learning to generate a task-aware base learner proves a promising direction to deal with few-shot learning (FSL) problem. Existing methods mainly focus on generating an embedding model utilized with a fixed metric (eg, cosine distance) for nearest neighbour classification or directly generating a linear classier. However, due to the limited discriminative capacity of such a simple metric or classifier, these methods fail to generalize to challenging cases appropriately. To mitigate this problem, we present a novel deep metric meta-generation method that turns to an orthogonal direction, ie, learning to adaptively generate a specific metric for a new FSL task based on the task description (eg, a few labelled samples). In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task. Moreover, different from existing methods that utilize an uni-modal weight distribution conditioned on labelled samples for network generation, the proposed meta-learner establishes a multi-modal weight distribution conditioned on cross-class sample pairs using a tailored variational autoencoder, which can separately capture the specific inter-class discrepancy statistics for each class and jointly embed the statistics for all classes into metric generation. By doing this, the generated metric can be appropriately adapted to a new FSL task with pleasing generalization performance. To demonstrate this, we test the proposed method on four benchmark FSL datasets and gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases, eg, improve the accuracy from 26.14% to 46.69% in the 20-way 1-shot task on miniImageNet, while improve the accuracy from 45.2% to 68.72% in the 5-way 1-shot task on FC100. Code is available: https://github.com/NWPUZhoufei/DAM.
In this paper, we present MicroNet, which is an efficient convolutional neural network using extremely low computational cost (e.g. 6 MFLOPs on ImageNet classification). Such a low cost network is highly desired on edge devices, yet usually suffers from a significant performance degradation. We handle the extremely low FLOPs based upon two design principles: (a) avoiding the reduction of network width by lowering the node connectivity, and (b) compensating for the reduction of network depth by introducing more complex non-linearity per layer. Firstly, we propose Micro-Factorized convolution to factorize both pointwise and depthwise convolutions into low rank matrices for a good tradeoff between the number of channels and input/output connectivity. Secondly, we propose a new activation function, named Dynamic Shift-Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. The fusions are dynamic as their parameters are adapted to the input. Building upon Micro-Factorized convolution and dynamic Shift-Max, a family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime. For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
In this paper we propose a 2D deep residual Unet with 104 convolutional layers (DR-Unet104) for lesion segmentation in brain MRIs. We make multiple additions to the Unet architecture, including adding the 'bottleneck' residual block to the Unet encoder and adding dropout after each convolution block stack. We verified the effect of introducing the regularisation of dropout with small rate (e.g. 0.2) on the architecture, and found a dropout of 0.2 improved the overall performance compared to no dropout, or a dropout of 0.5. We evaluated the proposed architecture as part of the Multimodal Brain Tumor Segmentation (BraTS) 2020 Challenge and compared our method to DeepLabV3+ with a ResNet-V2-152 backbone. We found that the DR-Unet104 achieved a mean dice score coefficient of 0.8862, 0.6756 and 0.6721 for validation data, whole tumor, enhancing tumor and tumor core respectively, an overall improvement on 0.8770, 0.65242 and 0.68134 achieved by DeepLabV3+. Our method produced a final mean DSC of 0.8673, 0.7514 and 0.7983 on whole tumor, enhancing tumor and tumor core on the challenge's testing data. We present this as a state-of-the-art 2D lesion segmentation architecture that can be used on lower power computers than a 3D architecture. The source code and trained model for this work is openly available at https://github.com/jordan-colman/DR-Unet104.
Human-robot interactions (HRI) can be modeled as dynamic or differential games with incomplete information, where each agent holds private reward parameters. Due to the open challenge in finding perfect Bayesian equilibria of such games, existing studies often consider approximated solutions composed of parameter estimation and motion planning steps, in order to decouple the belief and physical dynamics. In parameter estimation, current approaches often assume that the reward parameters of the robot are known by the humans. We argue that by falsely conditioning on this assumption, the robot performs non-empathetic estimation of the humans' parameters, leading to undesirable values even in the simplest interactions. We test this argument by studying a two-vehicle uncontrolled intersection case with short reaction time. Results show that when both agents are unknowingly aggressive (or non-aggressive), empathy leads to more effective parameter estimation and higher reward values, suggesting that empathy is necessary when the true parameters of agents mismatch with their common belief. The proposed estimation and planning algorithms are therefore more robust than the existing approaches, by fully acknowledging the nature of information asymmetry in HRI. Lastly, we introduce value approximation techniques for real-time execution of the proposed algorithms.
Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.