Alert button
Picture for Hao Huang

Hao Huang

Alert button

From Global to Local: Multi-scale Out-of-distribution Detection

Aug 20, 2023
Ji Zhang, Lianli Gao, Bingguang Hao, Hao Huang, Jingkuan Song, Hengtao Shen

Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation (ALPA), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision (CSD) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks -- on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.

* 13 pages 
Viaarxiv icon

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

Aug 19, 2023
Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang

Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.

* 12 pages, 11 figures 
Viaarxiv icon

SE-Bridge: Speech Enhancement with Consistent Brownian Bridge

May 23, 2023
Zhibin Qiu, Mengfan Fu, Fuchun Sun, Gulila Altenbek, Hao Huang

Figure 1 for SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
Figure 2 for SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
Figure 3 for SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
Figure 4 for SE-Bridge: Speech Enhancement with Consistent Brownian Bridge

We propose SE-Bridge, a novel method for speech enhancement (SE). After recently applying the diffusion models to speech enhancement, we can achieve speech enhancement by solving a stochastic differential equation (SDE). Each SDE corresponds to a probabilistic flow ordinary differential equation (PF-ODE), and the trajectory of the PF-ODE solution consists of the speech states at different moments. Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state. By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training. This is the first attempt that applies the consistency models to SE task, achieving state-of-the-art results in several metrics while saving 15 x the time required for sampling compared to the diffusion-based baseline. Our experiments on multiple datasets demonstrate the effectiveness of SE-Bridge in SE. Furthermore, we show through extensive experiments on downstream tasks, including Automatic Speech Recognition (ASR) and Speaker Verification (SV), that SE-Bridge can effectively support multiple downstream tasks.

Viaarxiv icon

Fashion Image Retrieval with Multi-Granular Alignment

Feb 27, 2023
Jinkuan Zhu, Hao Huang, Qiao Deng, Xiyao Li

Figure 1 for Fashion Image Retrieval with Multi-Granular Alignment
Figure 2 for Fashion Image Retrieval with Multi-Granular Alignment
Figure 3 for Fashion Image Retrieval with Multi-Granular Alignment
Figure 4 for Fashion Image Retrieval with Multi-Granular Alignment

Fashion image retrieval task aims to search relevant clothing items of a query image from the gallery. The previous recipes focus on designing different distance-based loss functions, pulling relevant pairs to be close and pushing irrelevant images apart. However, these methods ignore fine-grained features (e.g. neckband, cuff) of clothing images. In this paper, we propose a novel fashion image retrieval method leveraging both global and fine-grained features, dubbed Multi-Granular Alignment (MGA). Specifically, we design a Fine-Granular Aggregator(FGA) to capture and aggregate detailed patterns. Then we propose Attention-based Token Alignment (ATA) to align image features at the multi-granular level in a coarse-to-fine manner. To prove the effectiveness of our proposed method, we conduct experiments on two sub-tasks (In-Shop & Consumer2Shop) of the public fashion datasets DeepFashion. The experimental results show that our MGA outperforms the state-of-the-art methods by 1.8% and 0.6% in the two sub-tasks on the R@1 metric, respectively.

Viaarxiv icon

T-SEA: Transfer-based Self-Ensemble Attack on Object Detection

Nov 16, 2022
Hao Huang, Ziyan Chen, Huanran Chen, Yongtao Wang, Kevin Zhang

Figure 1 for T-SEA: Transfer-based Self-Ensemble Attack on Object Detection
Figure 2 for T-SEA: Transfer-based Self-Ensemble Attack on Object Detection
Figure 3 for T-SEA: Transfer-based Self-Ensemble Attack on Object Detection
Figure 4 for T-SEA: Transfer-based Self-Ensemble Attack on Object Detection

Compared to query-based black-box attacks, transfer-based black-box attacks do not require any information of the attacked models, which ensures their secrecy. However, most existing transfer-based approaches rely on ensembling multiple models to boost the attack transferability, which is time- and resource-intensive, not to mention the difficulty of obtaining diverse models on the same task. To address this limitation, in this work, we focus on the single-model transfer-based black-box attack on object detection, utilizing only one model to achieve a high-transferability adversarial attack on multiple black-box detectors. Specifically, we first make observations on the patch optimization process of the existing method and propose an enhanced attack framework by slightly adjusting its training strategies. Then, we analogize patch optimization with regular model optimization, proposing a series of self-ensemble approaches on the input data, the attacked model, and the adversarial patch to efficiently make use of the limited information and prevent the patch from overfitting. The experimental results show that the proposed framework can be applied with multiple classical base attack methods (e.g., PGD and MIM) to greatly improve the black-box transferability of the well-optimized patch on multiple mainstream detectors, meanwhile boosting white-box performance. Our code is available at https://github.com/VDIGPKU/T-SEA.

* 10 pages, 5 figures 
Viaarxiv icon

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

Nov 01, 2022
Yuhang Yang, Haihua Xu, Hao Huang, Eng Siong Chng, Sheng Li

Figure 1 for Speech-text based multi-modal training with bidirectional attention for improved speech recognition
Figure 2 for Speech-text based multi-modal training with bidirectional attention for improved speech recognition
Figure 3 for Speech-text based multi-modal training with bidirectional attention for improved speech recognition
Figure 4 for Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.

* 5 pages, 3 figures, 3 tables 
Viaarxiv icon

SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement

Oct 30, 2022
Zhibin Qiu, Mengfan Fu, Yinfeng Yu, LiLi Yin, Fuchun Sun, Hao Huang

Figure 1 for SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement
Figure 2 for SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement
Figure 3 for SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement
Figure 4 for SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement

Diffusion model, as a new generative model which is very popular in image generation and audio synthesis, is rarely used in speech enhancement. In this paper, we use the diffusion model as a module for stochastic refinement. We propose SRTNet, a novel method for speech enhancement via Stochastic Refinement in complete Time domain. Specifically, we design a joint network consisting of a deterministic module and a stochastic module, which makes up the ``enhance-and-refine'' paradigm. We theoretically demonstrate the feasibility of our method and experimentally prove that our method achieves faster training, faster sampling and higher quality. Our code and enhanced samples are available at https://github.com/zhibinQiu/SRTNet.git.

Viaarxiv icon

A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR

Oct 16, 2022
Rui Li, Guodong Ma, Dexin Zhao, Ranran Zeng, Xiaoyu Li, Hao Huang

Figure 1 for A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR
Figure 2 for A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR
Figure 3 for A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR
Figure 4 for A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR

SpecAugment is a very effective data augmentation method for both HMM and E2E-based automatic speech recognition (ASR) systems. Especially, it also works in low-resource scenarios. However, SpecAugment masks the spectrum of time or the frequency domain in a fixed augmentation policy, which may bring relatively less data diversity to the low-resource ASR. In this paper, we propose a policy-based SpecAugment (Policy-SpecAugment) method to alleviate the above problem. The idea is to use the augmentation-select policy and the augmentation-parameter changing policy to solve the fixed way. These policies are learned based on the loss of validation set, which is applied to the corresponding augmentation policies. It aims to encourage the model to learn more diverse data, which the model relatively requires. In experiments, we evaluate the effectiveness of our approach in low-resource scenarios, i.e., the 100 hours librispeech task. According to the results and analysis, we can see that the above issue can be obviously alleviated using our proposal. In addition, the experimental results show that, compared with the state-of-the-art SpecAugment, the proposed Policy-SpecAugment has a relative WER reduction of more than 10% on the Test/Dev-clean set, more than 5% on the Test/Dev-other set, and an absolute WER reduction of more than 1% on all test sets.

* Accepted to APSIPA ASC 2022 
Viaarxiv icon