Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicu Sebe

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

Mar 08, 2021
Fengxiang Yang, Zhun Zhong, Zhiming Luo, Yuanzheng Cai, Yaojin Lin, Shaozi Li, Nicu Sebe

Figure 1 for Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

Figure 2 for Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

Figure 3 for Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

Figure 4 for Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

This paper considers the problem of unsupervised person re-identification (re-ID), which aims to learn discriminative models with unlabeled data. One popular method is to obtain pseudo-label by clustering and use them to optimize the model. Although this kind of approach has shown promising accuracy, it is hampered by 1) noisy labels produced by clustering and 2) feature variations caused by camera shift. The former will lead to incorrect optimization and thus hinders the model accuracy. The latter will result in assigning the intra-class samples of different cameras to different pseudo-label, making the model sensitive to camera variations. In this paper, we propose a unified framework to solve both problems. Concretely, we propose a Dynamic and Symmetric Cross-Entropy loss (DSCE) to deal with noisy samples and a camera-aware meta-learning algorithm (MetaCam) to adapt camera shift. DSCE can alleviate the negative effects of noisy samples and accommodate the change of clusters after each clustering step. MetaCam simulates cross-camera constraint by splitting the training data into meta-train and meta-test based on camera IDs. With the interacted gradient from meta-train and meta-test, the model is enforced to learn camera-invariant features. Extensive experiments on three re-ID benchmarks show the effectiveness and the complementary of the proposed DSCE and MetaCam. Our method outperforms the state-of-the-art methods on both fully unsupervised re-ID and unsupervised domain adaptive re-ID.

* To appear in CVPR 2021

Via

Access Paper or Ask Questions

Curriculum Learning: A Survey

Jan 25, 2021
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, Nicu Sebe

Figure 1 for Curriculum Learning: A Survey

Figure 2 for Curriculum Learning: A Survey

Figure 3 for Curriculum Learning: A Survey

Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs. Curriculum learning strategies have been successfully employed in all areas of machine learning, in a wide range of tasks. However, the necessity of finding a way to rank the samples from easy to hard, as well as the right pacing function for introducing more difficult data can limit the usage of the curriculum approaches. In this survey, we show how these limits have been tackled in the literature, and we present different curriculum learning instantiations for various tasks in machine learning. We construct a multi-perspective taxonomy of curriculum learning approaches by hand, considering various classification criteria. We further build a hierarchical tree of curriculum learning methods using an agglomerative clustering algorithm, linking the discovered clusters with our taxonomy. At the end, we provide some interesting directions for future work.

Via

Access Paper or Ask Questions

Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

Jan 08, 2021
Dan Xu, Xavier Alameda-Pineda, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe

Figure 1 for Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

Figure 2 for Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

Figure 3 for Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

Figure 4 for Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner. In order to further improve the learning capacity of the network structure, we propose to exploit feature dependant conditional kernels within the deep probabilistic framework. Extensive experiments are conducted on four publicly available datasets (i.e. BSDS500, NYUD-V2, KITTI, and Pascal-Context) and on three challenging pixel-wise prediction problems involving both discrete and continuous labels (i.e. monocular depth estimation, object contour prediction, and semantic segmentation). Quantitative and qualitative results demonstrate the effectiveness of the proposed latent AG-CRF model and the overall probabilistic graph attention network with feature conditional kernels for structured feature learning and pixel-wise prediction.

* Regular paper accepted at TPAMI 2020. arXiv admin note: text overlap with arXiv:1801.00524

Via

Access Paper or Ask Questions

Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

Dec 01, 2020
Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, Nicu Sebe

Figure 1 for Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

Figure 2 for Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

Figure 3 for Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

Figure 4 for Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

Recent advances in person re-identification (ReID) obtain impressive accuracy in the supervised and unsupervised learning settings. However, most of the existing methods need to train a new model for a new domain by accessing data. Due to public privacy, the new domain data are not always accessible, leading to a limited applicability of these methods. In this paper, we study the problem of multi-source domain generalization in ReID, which aims to learn a model that can perform well on unseen domains with only several labeled source domains. To address this problem, we propose the Memory-based Multi-Source Meta-Learning (M$^3$L) framework to train a generalizable model for unseen domains. Specifically, a meta-learning strategy is introduced to simulate the train-test process of domain generalization for learning more generalizable models. To overcome the unstable meta-optimization caused by the parametric classifier, we propose a memory-based identification loss that is non-parametric and harmonizes with meta-learning. We also present a meta batch normalization layer (MetaBN) to diversify meta-test features, further establishing the advantage of meta-learning. Experiments demonstrate that our M$^3$L can effectively enhance the generalization ability of the model for unseen domains and can outperform the state-of-the-art methods on four large-scale ReID datasets.

Via

Access Paper or Ask Questions

Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Nov 10, 2020
Jean Pablo Vieira de Mello, Lucas Tabelini, Rodrigo F. Berriel, Thiago M. Paixão, Alberto F. de Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos

Figure 1 for Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Figure 2 for Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Figure 3 for Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Figure 4 for Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Deep neural networks come as an effective solution to many problems associated with autonomous driving. By providing real image samples with traffic context to the network, the model learns to detect and classify elements of interest, such as pedestrians, traffic signs, and traffic lights. However, acquiring and annotating real data can be extremely costly in terms of time and effort. In this context, we propose a method to generate artificial traffic-related training data for deep traffic light detectors. This data is generated using basic non-realistic computer graphics to blend fake traffic scenes on top of arbitrary image backgrounds that are not related to the traffic domain. Thus, a large amount of training data can be generated without annotation efforts. Furthermore, it also tackles the intrinsic data imbalance problem in traffic light datasets, caused mainly by the low amount of samples of the yellow state. Experiments show that it is possible to achieve results comparable to those obtained with real training data from the problem domain, yielding an average mAP and an average F1-score which are each nearly 4 p.p. higher than the respective metrics obtained with a real-world reference model.

* Computers & Graphics (2020)

Via

Access Paper or Ask Questions

SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection

Oct 19, 2020
Cristiano Saltori, Stéphane Lathuiliére, Nicu Sebe, Elisa Ricci, Fabio Galasso

$Figure 1 for SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection$

$Figure 2 for SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection$

$Figure 3 for SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection$

$Figure 4 for SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection$

3D object detectors based only on LiDAR point clouds hold the state-of-the-art on modern street-view benchmarks. However, LiDAR-based detectors poorly generalize across domains due to domain shift. In the case of LiDAR, in fact, domain shift is not only due to changes in the environment and in the object appearances, as for visual data from RGB cameras, but is also related to the geometry of the point clouds (e.g., point density variations). This paper proposes SF-UDA$^{3D}$, the first Source-Free Unsupervised Domain Adaptation (SF-UDA) framework to domain-adapt the state-of-the-art PointRCNN 3D detector to target domains for which we have no annotations (unsupervised), neither we hold images nor annotations of the source domain (source-free). SF-UDA$^{3D}$ is novel on both aspects. Our approach is based on pseudo-annotations, reversible scale-transformations and motion coherency. SF-UDA$^{3D}$ outperforms both previous domain adaptation techniques based on features alignment and state-of-the-art 3D object detection methods which additionally use few-shot target annotations or target annotation statistics. This is demonstrated by extensive experiments on two large-scale datasets, i.e., KITTI and nuScenes.

* Accepted paper at 3DV 2020

Via

Access Paper or Ask Questions

Latent World Models For Intrinsically Motivated Exploration

Oct 05, 2020
Aleksandr Ermolov, Nicu Sebe

Figure 1 for Latent World Models For Intrinsically Motivated Exploration

Figure 2 for Latent World Models For Intrinsically Motivated Exploration

Figure 3 for Latent World Models For Intrinsically Motivated Exploration

Figure 4 for Latent World Models For Intrinsically Motivated Exploration

In this work we consider partially observable environments with sparse rewards. We present a self-supervised representation learning method for image-based observations, which arranges embeddings respecting temporal distance of observations. This representation is empirically robust to stochasticity and suitable for novelty detection from the error of a predictive forward model. We consider episodic and life-long uncertainties to guide the exploration. We propose to estimate the missing information about the environment with the world model, which operates in the learned latent space. As a motivation of the method, we analyse the exploration problem in a tabular Partially Observable Labyrinth. We demonstrate the method on image-based hard exploration environments from the Atari benchmark and report significant improvement with respect to prior work. The source code of the method and all the experiments is available at https://github.com/htdt/lwm.

* NeurIPS 2020 Spotlight; Code is publicly available at https://github.com/htdt/lwm

Via

Access Paper or Ask Questions

Dual Attention GANs for Semantic Image Synthesis

Aug 29, 2020
Hao Tang, Song Bai, Nicu Sebe

Figure 1 for Dual Attention GANs for Semantic Image Synthesis

Figure 2 for Dual Attention GANs for Semantic Image Synthesis

Figure 3 for Dual Attention GANs for Semantic Image Synthesis

Figure 4 for Dual Attention GANs for Semantic Image Synthesis

In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters. The source code and trained models are available at https://github.com/Ha0Tang/DAGAN.

* Accepted to ACM MM 2020, camera ready (9 pages) + supplementary (10 pages)

Via

Access Paper or Ask Questions

Bipartite Graph Reasoning GANs for Person Image Generation

Aug 20, 2020
Hao Tang, Song Bai, Philip H. S. Torr, Nicu Sebe

Figure 1 for Bipartite Graph Reasoning GANs for Person Image Generation

Figure 2 for Bipartite Graph Reasoning GANs for Person Image Generation

Figure 3 for Bipartite Graph Reasoning GANs for Person Image Generation

Figure 4 for Bipartite Graph Reasoning GANs for Person Image Generation

We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the challenging person image generation task. The proposed graph generator mainly consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed Bipartite Graph Reasoning (BGR) block aims to reason the crossing long-range relations between the source pose and the target pose in a bipartite graph, which mitigates some challenges caused by pose deformation. Moreover, we propose a new Interaction-and-Aggregation (IA) block to effectively update and enhance the feature representation capability of both person's shape and appearance in an interactive way. Experiments on two challenging and public datasets, i.e., Market-1501 and DeepFashion, show the effectiveness of the proposed BiGraphGAN in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/BiGraphGAN.

* 13 pages, 6 figures, accepted to BMVC 2020 as an oral paper, fix typos

Via

Access Paper or Ask Questions

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Aug 13, 2020
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, Xiao-Yuan Jing

Figure 1 for DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Figure 2 for DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Figure 3 for DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Figure 4 for DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Synthesizing high-resolution realistic images from text descriptions is a challenging task. Almost all existing text-to-image methods employ stacked generative adversarial networks as the backbone, utilize cross-modal attention mechanisms to fuse text and image features, and use extra networks to ensure text-image semantic consistency. The existing text-to-image models have three problems: 1) For the backbone, there are multiple generators and discriminators stacked for generating different scales of images making the training process slow and inefficient. 2) For semantic consistency, the existing models employ extra networks to ensure the semantic consistency increasing the training complexity and bringing an additional computational cost. 3) For the text-image feature fusion method, cross-modal attention is only applied a few times during the generation process due to its computational cost impeding fusing the text and image features deeply. To solve these limitations, we propose 1) a novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, 2) a novel regularization method called Matching-Aware zero-centered Gradient Penalty which promotes the generator to synthesize more realistic and text-image semantic consistent images without introducing extra networks, 3) a novel fusion module called Deep Text-Image Fusion Block which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process. Compared with the previous text-to-image models, our DF-GAN is simpler and more efficient and achieves better performance. Extensive experiments and ablation studies on both Caltech-UCSD Birds 200 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models.

Via

Access Paper or Ask Questions