Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pan Zhou

Huazhong University of Science and Technology

Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Apr 22, 2021
Qiming Wu, Zhikang Zou, Pan Zhou, Xiaoqing Ye, Binghui Wang, Ang Li

Figure 1 for Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Figure 2 for Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Figure 3 for Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Figure 4 for Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Especially, deep neural network (DNN) methods have significantly reduced estimation errors for crowd counting missions. Recent studies have demonstrated that DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible perturbations could mislead DNNs to make false predictions. In this work, we propose a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically evaluate the robustness of crowd counting models, where the attacker's goal is to create an adversarial perturbation that severely degrades their performances, thus leading to public safety accidents (e.g., stampede accidents). Especially, the proposed attack leverages the extreme-density background information of input images to generate robust adversarial patches via a series of transformations (e.g., interpolation, rotation, etc.). We observe that by perturbing less than 6\% of image pixels, our attacks severely degrade the performance of crowd counting systems, both digitally and physically. To better enhance the adversarial robustness of crowd counting models, we propose the first regression model-based Randomized Ablation (RA), which is more sufficient than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on clean samples and 30 lower than ADT on adversarial examples). Extensive experiments on five crowd counting models demonstrate the effectiveness and generality of the proposed method. Code is available at \url{https://github.com/harrywuhust2022/Adv-Crowd-analysis}.

Via

Access Paper or Ask Questions

WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Apr 21, 2021
Zhichao Wang, Wenwen Yang, Pan Zhou, Wei Chen

Figure 1 for WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Figure 2 for WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Figure 3 for WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Figure 4 for WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Recently, attention-based encoder-decoder (AED) end-to-end (E2E) models have drawn more and more attention in the field of automatic speech recognition (ASR). AED models, however, still have drawbacks when deploying in commercial applications. Autoregressive beam search decoding makes it inefficient for high-concurrency applications. It is also inconvenient to integrate external word-level language models. The most important thing is that AED models are difficult for streaming recognition due to global attention mechanism. In this paper, we propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers (WFST) to solve these problems together. We switch from autoregressive beam search to CTC branch decoding, which performs first-pass decoding with WFST in chunk-wise streaming way. The decoder branch then performs second-pass rescoring on the generated hypotheses non-autoregressively. On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR. Further experiments on our 10,000-hour Mandarin task show the proposed method achieves more than 20% improvements with 50% latency compared to a strong TDNN-BLSTM lattice-free MMI baseline.

Via

Access Paper or Ask Questions

Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Apr 07, 2021
Xian Shi, Pan Zhou, Wei Chen, Lei Xie

Figure 1 for Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Figure 2 for Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Figure 3 for Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Figure 4 for Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Neural architecture search (NAS) has been successfully applied to tasks like image classification and language modeling for finding efficient high-performance network architectures. In ASR field especially end-to-end ASR, the related research is still in its infancy. In this work, we focus on applying NAS on the most popular manually designed model: Conformer, and then propose an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads. We fuse Darts mutator and Conformer blocks to form a complete search space, within which a modified architecture called Darts-Conformer cell is found automatically. The entire searching process on AISHELL-1 dataset costs only 0.7 GPU days. Replacing the Conformer encoder by stacking searched cell, we get an end-to-end ASR model (named as Darts-Conformner) that outperforms the Conformer baseline by 4.7\% on the open-source AISHELL-1 dataset. Besides, we verify the transferability of the architecture searched on a small dataset to a larger 2k-hour dataset. To the best of our knowledge, this is the first successful attempt to apply gradient-based architecture search in the attention-based encoder-decoder ASR model.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Mar 22, 2021
Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, Yulai Xie

Figure 1 for Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Figure 2 for Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Figure 3 for Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Figure 4 for Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. Previous works either compare pre-defined candidate segments with the query and select the best one by ranking, or directly regress the boundary timestamps of the target segment. In this paper, we propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism. In particular, we present a Context-aware Biaffine Localizing Network (CBLN) which incorporates both local and global contexts into features of each start/end position for biaffine-based localization. The local contexts from the adjacent frames help distinguish the visually similar appearance, and the global contexts from the entire video contribute to reasoning the temporal relation. Besides, we also develop a multi-modal self-attention module to provide fine-grained query-guided video representation for this biaffine strategy. Extensive experiments show that our CBLN significantly outperforms state-of-the-arts on three public datasets (ActivityNet Captions, TACoS, and Charades-STA), demonstrating the effectiveness of the proposed localization framework.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Mar 02, 2021
Wenxiao Wang, Tianhao Wang, Lun Wang, Nanqing Luo, Pan Zhou, Dawn Song, Ruoxi Jia

Figure 1 for DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Figure 2 for DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Figure 3 for DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Figure 4 for DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of the same training algorithm produce models with large performance variance. To address these issues, we propose DPlis--Differentially Private Learning wIth Smoothing. The core idea of DPlis is to construct a smooth loss function that favors noise-resilient models lying in large flat regions of the loss landscape. We provide theoretical justification for the utility improvements of DPlis. Extensive experiments also demonstrate that DPlis can effectively boost model quality and training stability under a given privacy budget.

* 20 pages, 7 figures

Via

Access Paper or Ask Questions

Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Dec 23, 2020
Yubei Xiao, Ke Gong, Pan Zhou, Guolin Zheng, Xiaodan Liang, Liang Lin

Figure 1 for Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Figure 2 for Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Figure 3 for Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Figure 4 for Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Low-resource automatic speech recognition (ASR) is challenging, as the low-resource target language data cannot well train an ASR model. To solve this issue, meta-learning formulates ASR for each source language into many small ASR tasks and meta-learns a model initialization on all tasks from different source languages to access fast adaptation on unseen target languages. However, for different source languages, the quantity and difficulty vary greatly because of their different data scales and diverse phonological systems, which leads to task-quantity and task-difficulty imbalance issues and thus a failure of multilingual meta-learning ASR (MML-ASR). In this work, we solve this problem by developing a novel adversarial meta sampling (AMS) approach to improve MML-ASR. When sampling tasks in MML-ASR, AMS adaptively determines the task sampling probability for each source language. Specifically, for each source language, if the query loss is large, it means that its tasks are not well sampled to train ASR model in terms of its quantity and difficulty and thus should be sampled more frequently for extra learning. Inspired by this fact, we feed the historical task query loss of all source language domain into a network to learn a task sampling policy for adversarially increasing the current query loss of MML-ASR. Thus, the learnt task sampling policy can master the learning situation of each language and thus predicts good task sampling probability for each language for more effective learning. Finally, experiment results on two multilingual datasets show significant performance improvement when applying our AMS on MML-ASR, and also demonstrate the applicability of AMS to other low-resource speech tasks and transfer learning ASR approaches. Our codes are available at: https://github.com/iamxiaoyubei/AMS.

* accepted in AAAI2021

Via

Access Paper or Ask Questions

Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Dec 22, 2020
Shuai Lin, Pan Zhou, Xiaodan Liang, Jianheng Tang, Ruihui Zhao, Ziliang Chen, Liang Lin

Figure 1 for Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Figure 2 for Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Figure 3 for Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Figure 4 for Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Human doctors with well-structured medical knowledge can diagnose a disease merely via a few conversations with patients about symptoms. In contrast, existing knowledge-grounded dialogue systems often require a large number of dialogue instances to learn as they fail to capture the correlations between different diseases and neglect the diagnostic experience shared among them. To address this issue, we propose a more natural and practical paradigm, i.e., low-resource medical dialogue generation, which can transfer the diagnostic experience from source diseases to target ones with a handful of data for adaptation. It is capitalized on a commonsense knowledge graph to characterize the prior disease-symptom relations. Besides, we develop a Graph-Evolving Meta-Learning (GEML) framework that learns to evolve the commonsense graph for reasoning disease-symptom correlations in a new disease, which effectively alleviates the needs of a large number of dialogues. More importantly, by dynamically evolving disease-symptom graphs, GEML also well addresses the real-world challenges that the disease-symptom correlations of each disease may vary or evolve along with more diagnostic cases. Extensive experiment results on the CMDD dataset and our newly-collected Chunyu dataset testify the superiority of our approach over state-of-the-art approaches. Besides, our GEML can generate an enriched dialogue-sensitive knowledge graph in an online manner, which could benefit other tasks grounded on knowledge graph.

* Accepted by AAAI 2021

Via

Access Paper or Ask Questions

Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

Dec 10, 2020
Daizong Liu, Shuangjie Xu, Xiao-Yang Liu, Zichuan Xu, Wei Wei, Pan Zhou

Figure 1 for Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

Figure 2 for Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

Figure 3 for Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

Figure 4 for Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. Although previous detection based methods achieve relatively good performance, these approaches extract the best proposal by a greedy strategy, which may lose the local patch details outside the chosen candidate. In this paper, we propose a novel spatiotemporal graph neural network (STG-Net) to reconstruct more accurate masks for video object segmentation, which captures the local contexts by utilizing all proposals. In the spatial graph, we treat object proposals of a frame as nodes and represent their correlations with an edge weight strategy for mask context aggregation. To capture temporal information from previous frames, we use a memory network to refine the mask of current frame by retrieving historic masks in a temporal graph. The joint use of both local patch details and temporal relationships allow us to better address the challenges such as object occlusion and missing. Without online learning and fine-tuning, our STG-Net achieves state-of-the-art performance on four large benchmarks (DAVIS, YouTube-VOS, SegTrack-v2, and YouTube-Objects), demonstrating the effectiveness of the proposed approach.

* Accepted by AAAI 2021

Via

Access Paper or Ask Questions