Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nuno Vasconcelos

IMAGINE: Image Synthesis by Image-Guided Model Inversion

Apr 13, 2021

Pei Wang, Yijun Li, Krishna Kumar Singh, Jingwan Lu, Nuno Vasconcelos

Figure 1 for IMAGINE: Image Synthesis by Image-Guided Model Inversion

Figure 2 for IMAGINE: Image Synthesis by Image-Guided Model Inversion

Figure 3 for IMAGINE: Image Synthesis by Image-Guided Model Inversion

Figure 4 for IMAGINE: Image Synthesis by Image-Guided Model Inversion

Abstract:We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images from only a single training sample. We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations via matching multi-level feature representations in the classifier, associated with adversarial training with an external discriminator. IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process. With extensive experimental results, we demonstrate qualitatively and quantitatively that IMAGINE performs favorably against state-of-the-art GAN-based and inversion-based methods, across three different image domains (i.e., objects, scenes, and textures).

* Published in CVPR2021

Via

Access Paper or Ask Questions

Rethinking and Improving the Robustness of Image Style Transfer

Apr 08, 2021

Pei Wang, Yijun Li, Nuno Vasconcelos

Figure 1 for Rethinking and Improving the Robustness of Image Style Transfer

Figure 2 for Rethinking and Improving the Robustness of Image Style Transfer

Figure 3 for Rethinking and Improving the Robustness of Image Style Transfer

Figure 4 for Rethinking and Improving the Robustness of Image Style Transfer

Abstract:Extensive research in neural style transfer methods has shown that the correlation between features extracted by a pre-trained VGG network has a remarkable ability to capture the visual style of an image. Surprisingly, however, this stylization quality is not robust and often degrades significantly when applied to features from more advanced and lightweight networks, such as those in the ResNet family. By performing extensive experiments with different network architectures, we find that residual connections, which represent the main architectural difference between VGG and ResNet, produce feature maps of small entropy, which are not suitable for style transfer. To improve the robustness of the ResNet architecture, we then propose a simple yet effective solution based on a softmax transformation of the feature activations that enhances their entropy. Experimental results demonstrate that this small magic can greatly improve the quality of stylization results, even for networks with random weights. This suggests that the architecture used for feature extraction is more important than the use of learned weights for the task of style transfer.

* Published in CVPR2021 (Oral)

Via

Access Paper or Ask Questions

Robust Audio-Visual Instance Discrimination

Mar 29, 2021

Pedro Morgado, Ishan Misra, Nuno Vasconcelos

Figure 1 for Robust Audio-Visual Instance Discrimination

Figure 2 for Robust Audio-Visual Instance Discrimination

Figure 3 for Robust Audio-Visual Instance Discrimination

Figure 4 for Robust Audio-Visual Instance Discrimination

Abstract:We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty negatives, we propose to optimize an instance discrimination loss with a soft target distribution that estimates relationships between instances. We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.

Via

Access Paper or Ask Questions

Dynamic Transfer for Multi-Source Domain Adaptation

Mar 19, 2021

Yunsheng Li, Lu Yuan, Yinpeng Chen, Pei Wang, Nuno Vasconcelos

Figure 1 for Dynamic Transfer for Multi-Source Domain Adaptation

Figure 2 for Dynamic Transfer for Multi-Source Domain Adaptation

Figure 3 for Dynamic Transfer for Multi-Source Domain Adaptation

Figure 4 for Dynamic Transfer for Multi-Source Domain Adaptation

Abstract:Recent works of multi-source domain adaptation focus on learning a domain-agnostic model, of which the parameters are static. However, such a static model is difficult to handle conflicts across multiple domains, and suffers from a performance degradation in both source domains and target domain. In this paper, we present dynamic transfer to address domain conflicts, where the model parameters are adapted to samples. The key insight is that adapting model across domains is achieved via adapting model across samples. Thus, it breaks down source domain barriers and turns multi-source domains into a single-source domain. This also simplifies the alignment between source and target domains, as it only requires the target domain to be aligned with any part of the union of source domains. Furthermore, we find dynamic transfer can be simply modeled by aggregating residual matrices and a static convolution matrix. Experimental results show that, without using domain labels, our dynamic transfer outperforms the state-of-the-art method by more than 3% on the large multi-source domain adaptation datasets -- DomainNet. Source code is at https://github.com/liyunsheng13/DRT.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

Revisiting Dynamic Convolution via Matrix Decomposition

Mar 15, 2021

Yunsheng Li, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Ye Yu, Lu Yuan, Zicheng Liu, Mei Chen, Nuno Vasconcelos

Figure 1 for Revisiting Dynamic Convolution via Matrix Decomposition

Figure 2 for Revisiting Dynamic Convolution via Matrix Decomposition

Figure 3 for Revisiting Dynamic Convolution via Matrix Decomposition

Figure 4 for Revisiting Dynamic Convolution via Matrix Decomposition

Abstract:Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels. It has two limitations: (a) it increases the number of convolutional weights by K-times, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attention over channel groups after projecting into a higher dimensional latent space. To address this issue, we propose dynamic channel fusion to replace dynamic attention over channel groups. Dynamic channel fusion not only enables significant dimension reduction of the latent space, but also mitigates the joint optimization difficulty. As a result, our method is easier to train and requires significantly fewer parameters without sacrificing accuracy. Source code is at https://github.com/liyunsheng13/dcd.

* Accepted by ICLR 2021

Via

Access Paper or Ask Questions

MicroNet: Towards Image Recognition with Extremely Low FLOPs

Nov 24, 2020

Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, Nuno Vasconcelos

Figure 1 for MicroNet: Towards Image Recognition with Extremely Low FLOPs

Figure 2 for MicroNet: Towards Image Recognition with Extremely Low FLOPs

Figure 3 for MicroNet: Towards Image Recognition with Extremely Low FLOPs

Figure 4 for MicroNet: Towards Image Recognition with Extremely Low FLOPs

Abstract:In this paper, we present MicroNet, which is an efficient convolutional neural network using extremely low computational cost (e.g. 6 MFLOPs on ImageNet classification). Such a low cost network is highly desired on edge devices, yet usually suffers from a significant performance degradation. We handle the extremely low FLOPs based upon two design principles: (a) avoiding the reduction of network width by lowering the node connectivity, and (b) compensating for the reduction of network depth by introducing more complex non-linearity per layer. Firstly, we propose Micro-Factorized convolution to factorize both pointwise and depthwise convolutions into low rank matrices for a good tradeoff between the number of channels and input/output connectivity. Secondly, we propose a new activation function, named Dynamic Shift-Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. The fusions are dynamic as their parameters are adapted to the input. Building upon Micro-Factorized convolution and dynamic Shift-Max, a family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime. For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.

Via

Access Paper or Ask Questions

Learning Representations from Audio-Visual Spatial Alignment

Nov 03, 2020

Pedro Morgado, Yi Li, Nuno Vasconcelos

Figure 1 for Learning Representations from Audio-Visual Spatial Alignment

Figure 2 for Learning Representations from Audio-Visual Spatial Alignment

Figure 3 for Learning Representations from Audio-Visual Spatial Alignment

Figure 4 for Learning Representations from Audio-Visual Spatial Alignment

Abstract:We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360{\deg} video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360{\deg} video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition, and video semantic segmentation.

* To appear at Advances in Neural Information Processing Systems (NeurIPS), 2020

Via

Access Paper or Ask Questions

Contrastive Learning with Adversarial Examples

Oct 22, 2020

Chih-Hui Ho, Nuno Vasconcelos

Figure 1 for Contrastive Learning with Adversarial Examples

Figure 2 for Contrastive Learning with Adversarial Examples

Figure 3 for Contrastive Learning with Adversarial Examples

Figure 4 for Contrastive Learning with Adversarial Examples

Abstract:Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations. It uses pairs of augmentations of unlabeled training examples to define a classification task for pretext learning of a deep embedding. Despite extensive works in augmentation procedures, prior works do not address the selection of challenging negative pairs, as images within a sampled batch are treated independently. This paper addresses the problem, by introducing a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE. When compared to standard CL, the use of adversarial examples creates more challenging positive pairs and adversarial training produces harder negative pairs by accounting for all images in a batch during the optimization. CLAE is compatible with many CL methods in the literature. Experiments show that it improves the performance of several existing CL baselines on multiple datasets.

* NeurIPS 2020

Via

Access Paper or Ask Questions

Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Jul 27, 2020

Pedro Morgado, Yunsheng Li, Jose Costa Pereira, Mohammad Saberian, Nuno Vasconcelos

Figure 1 for Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Figure 2 for Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Figure 3 for Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Figure 4 for Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Abstract:Image hash codes are produced by binarizing the embeddings of convolutional neural networks (CNN) trained for either classification or retrieval. While proxy embeddings achieve good performance on both tasks, they are non-trivial to binarize, due to a rotational ambiguity that encourages non-binary embeddings. The use of a fixed set of proxies (weights of the CNN classification layer) is proposed to eliminate this ambiguity, and a procedure to design proxy sets that are nearly optimal for both classification and hashing is introduced. The resulting hash-consistent large margin (HCLM) proxies are shown to encourage saturation of hashing units, thus guaranteeing a small binarization error, while producing highly discriminative hash-codes. A semantic extension (sHCLM), aimed to improve hashing performance in a transfer scenario, is also proposed. Extensive experiments show that sHCLM embeddings achieve significant improvements over state-of-the-art hashing procedures on several small and large datasets, both within and beyond the set of training classes.

* Accepted at International Journal of Computer Vision

Via

Access Paper or Ask Questions

Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Jul 20, 2020

Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, Nuno Vasconcelos

Figure 1 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 2 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 3 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 4 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Abstract:Long-tail recognition tackles the natural non-uniformly distributed data in real-world scenarios. While modern classifiers perform well on populated classes, its performance degrades significantly on tail classes. Humans, however, are less affected by this since, when confronted with uncertain examples, they simply opt to provide coarser predictions. Motivated by this, a deep realistic taxonomic classifier (Deep-RTC) is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions. The model has the option to reject classifying samples at different levels of the taxonomy, once it cannot guarantee the desired performance. Deep-RTC is implemented with a stochastic tree sampling during training to simulate all possible classification conditions at finer or coarser levels and a rejection mechanism at inference time. Experiments on the long-tailed version of four datasets, CIFAR100, AWA2, Imagenet, and iNaturalist, demonstrate that the proposed approach preserves more information on all classes with different popularity levels. Deep-RTC also outperforms the state-of-the-art methods in longtailed recognition, hierarchical classification, and learning with rejection literature using the proposed correctly predicted bits (CPB) metric.

* Accepted to ECCV 2020

Via

Access Paper or Ask Questions