Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Tang

Conditioning and Sampling in Variational Diffusion Models for Speech Super-resolution

Oct 27, 2022

Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, Hao Tang

Abstract:Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

On Compressing Sequences for Self-Supervised Speech Models

Oct 14, 2022

Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang

Figure 1 for On Compressing Sequences for Self-Supervised Speech Models

Figure 2 for On Compressing Sequences for Self-Supervised Speech Models

Figure 3 for On Compressing Sequences for Self-Supervised Speech Models

Figure 4 for On Compressing Sequences for Self-Supervised Speech Models

Abstract:Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.

* Accepted to IEEE SLT 2022

Via

Access Paper or Ask Questions

SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

Oct 10, 2022

Yitong Xia, Hao Tang, Radu Timofte, Luc Van Gool

Figure 1 for SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

Figure 2 for SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

Figure 3 for SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

Figure 4 for SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

Abstract:NeRFmm is the Neural Radiance Fields (NeRF) that deal with Joint Optimization tasks, i.e., reconstructing real-world scenes and registering camera parameters simultaneously. Despite NeRFmm producing precise scene synthesis and pose estimations, it still struggles to outperform the full-annotated baseline on challenging scenes. In this work, we identify that there exists a systematic sub-optimality in joint optimization and further identify multiple potential sources for it. To diminish the impacts of potential sources, we propose Sinusoidal Neural Radiance Fields (SiNeRF) that leverage sinusoidal activations for radiance mapping and a novel Mixed Region Sampling (MRS) for selecting ray batch efficiently. Quantitative and qualitative results show that compared to NeRFmm, SiNeRF achieves comprehensive significant improvements in image synthesis quality and pose estimation accuracy. Codes are available at https://github.com/yitongx/sinerf.

* Accepted yet not published by BMVC2022

Via

Access Paper or Ask Questions

Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Oct 04, 2022

Zican Zha, Hao Tang, Yunlian Sun, Jinhui Tang

Figure 1 for Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Figure 2 for Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Figure 3 for Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Figure 4 for Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Abstract:Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. Undoubtedly, this task inherits the main challenges from both few-shot learning and fine-grained recognition. First, the lack of labeled samples makes the learned model easy to overfit. Second, it also suffers from high intra-class variance and low inter-class difference in the datasets. To address this challenging task, we propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Specifically, the BAS is introduced to generate a foreground mask for localization to weaken background disturbance and enhance dominative foreground objects. What's more, considering the lack of labeled samples, we compute the pairwise similarity of feature maps using both the raw image and the refined image. The FOA then reconstructs the feature map of each support sample according to its correction to the query ones, which addresses the problem of misalignment between support-query image pairs. To enable the proposed method to have the ability to capture subtle differences in confused samples, we present a novel L2L similarity metric to further measure the local similarity between a pair of aligned spatial features in the embedding space. Extensive experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.

* Preprint under review in TCSVT Journal

Via

Access Paper or Ask Questions

Physical Adversarial Attack meets Computer Vision: A Decade Survey

Sep 30, 2022

Hui Wei, Hao Tang, Xuemei Jia, Hanxun Yu, Zhubo Li, Zhixiang Wang, Shin'ichi Satoh, Zheng Wang

Figure 1 for Physical Adversarial Attack meets Computer Vision: A Decade Survey

Figure 2 for Physical Adversarial Attack meets Computer Vision: A Decade Survey

Figure 3 for Physical Adversarial Attack meets Computer Vision: A Decade Survey

Figure 4 for Physical Adversarial Attack meets Computer Vision: A Decade Survey

Abstract:Although Deep Neural Networks (DNNs) have achieved impressive results in computer vision, their exposed vulnerability to adversarial attacks remains a serious concern. A series of works has shown that by adding elaborate perturbations to images, DNNs could have catastrophic degradation in performance metrics. And this phenomenon does not only exist in the digital space but also in the physical space. Therefore, estimating the security of these DNNs-based systems is critical for safely deploying them in the real world, especially for security-critical applications, e.g., autonomous cars, video surveillance, and medical diagnosis. In this paper, we focus on physical adversarial attacks and provide a comprehensive survey of over 150 existing papers. We first clarify the concept of the physical adversarial attack and analyze its characteristics. Then, we define the adversarial medium, essential to perform attacks in the physical world. Next, we present the physical adversarial attack methods in task order: classification, detection, and re-identification, and introduce their performance in solving the trilemma: effectiveness, stealthiness, and robustness. In the end, we discuss the current challenges and potential future directions.

* 32 pages. arXiv admin note: text overlap with arXiv:2207.04718, arXiv:2011.13375 by other authors

Via

Access Paper or Ask Questions

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Sep 16, 2022

Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie

Figure 1 for PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Figure 2 for PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Figure 3 for PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Figure 4 for PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Abstract:Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.

* ECCV 2022. Code is available at https://github.com/HowieMa/PPT

Via

Access Paper or Ask Questions

Facial Expression Translation using Landmark Guided GANs

Sep 05, 2022

Hao Tang, Nicu Sebe

Figure 1 for Facial Expression Translation using Landmark Guided GANs

Figure 2 for Facial Expression Translation using Landmark Guided GANs

Figure 3 for Facial Expression Translation using Landmark Guided GANs

Figure 4 for Facial Expression Translation using Landmark Guided GANs

Abstract:We propose a simple yet powerful Landmark guided Generative Adversarial Network (LandmarkGAN) for the facial expression-to-expression translation using a single image, which is an important and challenging task in computer vision since the expression-to-expression translation is a non-linear and non-aligned problem. Moreover, it requires a high-level semantic understanding between the input and output images since the objects in images can have arbitrary poses, sizes, locations, backgrounds, and self-occlusions. To tackle this problem, we propose utilizing facial landmark information explicitly. Since it is a challenging problem, we split it into two sub-tasks, (i) category-guided landmark generation, and (ii) landmark-guided expression-to-expression translation. Two sub-tasks are trained in an end-to-end fashion that aims to enjoy the mutually improved benefits from the generated landmarks and expressions. Compared with current keypoint-guided approaches, the proposed LandmarkGAN only needs a single facial image to generate various expressions. Extensive experimental results on four public datasets demonstrate that the proposed LandmarkGAN achieves better results compared with state-of-the-art approaches only using a single image. The code is available at https://github.com/Ha0Tang/LandmarkGAN.

* Accepted to TAFFC

Via

Access Paper or Ask Questions

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Aug 30, 2022

Shuanglin Yan, Hao Tang, Liyan Zhang, Jinhui Tang

Figure 1 for Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Figure 2 for Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Figure 3 for Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Figure 4 for Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Abstract:Text-based person search is a challenging task that aims to search pedestrian images with the same identity from the image gallery given a query text description. In recent years, text-based person search has made good progress, and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, the existing methods explicitly extract image parts and text phrases from images and texts by hand-crafted split or external tools and then conduct complex cross-modal local matching. Moreover, the existing methods seldom consider the problem of information inequality between modalities caused by image-specific information. In this paper, we propose an efficient joint Information and Semantic Alignment Network (ISANet) for text-based person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guide localization and channel attention filtration respectively. This design can effectively alleviate the problem of information inequality and realize the information alignment between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate image and text features to a set of modality-shared semantic topic centers, and implicitly learn the local fine-grained correspondence between images and texts without additional supervision information and complex cross-modal interactions. Moreover, a global alignment is introduced as a supplement to the local perspective. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of the proposed ISANet.

Via

Access Paper or Ask Questions

Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Aug 26, 2022

Jichao Zhang, Aliaksandr Siarohin, Yahui Liu, Hao Tang, Nicu Sebe, Wei Wang

Figure 1 for Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Figure 2 for Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Figure 3 for Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Figure 4 for Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Abstract:3D-aware GANs based on generative neural radiance fields (GNeRF) have achieved impressive high-quality image generation, while preserving strong 3D consistency. The most notable achievements are made in the face generation domain. However, most of these models focus on improving view consistency but neglect a disentanglement aspect, thus these models cannot provide high-quality semantic/attribute control over generation. To this end, we introduce a conditional GNeRF model that uses specific attribute labels as input in order to improve the controllabilities and disentangling abilities of 3D-aware generative models. We utilize the pre-trained 3D-aware model as the basis and integrate a dual-branches attribute-editing module (DAEM), that utilize attribute labels to provide control over generation. Moreover, we propose a TRIOT (TRaining as Init, and Optimizing for Tuning) method to optimize the latent vector to improve the precision of the attribute-editing further. Extensive experiments on the widely used FFHQ show that our model yields high-quality editing with better view consistency while preserving the non-target regions. The code is available at https://github.com/zhangqianhui/TT-GNeRF.

* 14 pages

Via

Access Paper or Ask Questions

Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Aug 25, 2022

Jianbing Wu, Hong Liu, Wei Shi, Hao Tang, Jingwen Guo

Figure 1 for Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Figure 2 for Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Figure 3 for Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Figure 4 for Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Abstract:Cloth-changing person re-identification (CC-ReID), which aims to match person identities under clothing changes, is a new rising research topic in recent years. However, typical biometrics-based CC-ReID methods often require cumbersome pose or body part estimators to learn cloth-irrelevant features from human biometric traits, which comes with high computational costs. Besides, the performance is significantly limited due to the resolution degradation of surveillance images. To address the above limitations, we propose an effective Identity-Sensitive Knowledge Propagation framework (DeSKPro) for CC-ReID. Specifically, a Cloth-irrelevant Spatial Attention module is introduced to eliminate the distraction of clothing appearance by acquiring knowledge from the human parsing module. To mitigate the resolution degradation issue and mine identity-sensitive cues from human faces, we propose to restore the missing facial details using prior facial knowledge, which is then propagated to a smaller network. After training, the extra computations for human parsing or face restoration are no longer required. Extensive experiments show that our framework outperforms state-of-the-art methods by a large margin. Our code is available at https://github.com/KimbingNg/DeskPro.

* IEEE International Conference on Image Processing (ICIP) 2022

Via

Access Paper or Ask Questions