Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boqing Gong

Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Jul 18, 2018
Adnan Siraj Rakin, Jinfeng Yi, Boqing Gong, Deliang Fan

Figure 1 for Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Figure 2 for Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Figure 3 for Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Figure 4 for Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Recent studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks. To this end, many defense approaches that attempt to improve the robustness of DNNs have been proposed. In a separate and yet related area, recent works have explored to quantize neural network weights and activation functions into low bit-width to compress model size and reduce computational complexity. In this work,we find that these two different tracks, namely the pursuit of network compactness and robustness, can bemerged into one and give rise to networks of both advantages. To the best of our knowledge, this is the first work that uses quantization of activation functions to defend against adversarial examples. We also propose to train robust neural networks by using adaptive quantization techniques for the activation functions. Our proposed Dynamic Quantized Activation (DQA) is verified through a wide range of experiments with the MNIST and CIFAR-10 datasets under different white-box attack methods, including FGSM, PGD, andC&W attacks. Furthermore, Zeroth Order Optimization and substitute model based black-box attacks are also considered in this work. The experimental results clearly show that the robustness of DNNs could be greatly improved using the proposed DQA.

Via

Access Paper or Ask Questions

End-to-End Learning of Motion Representation for Video Understanding

Apr 02, 2018
Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, Junzhou Huang

Figure 1 for End-to-End Learning of Motion Representation for Video Understanding

Figure 2 for End-to-End Learning of Motion Representation for Video Understanding

Figure 3 for End-to-End Learning of Motion Representation for Video Understanding

Figure 4 for End-to-End Learning of Motion Representation for Video Understanding

Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks. To fill this gap, we propose TVNet, a novel end-to-end trainable neural network, to learn optical-flow-like features from data. TVNet subsumes a specific optical flow solver, the TV-L1 method, and is initialized by unfolding its optimization iterations as neural layers. TVNet can therefore be used directly without any extra learning. Moreover, it can be naturally concatenated with other task-specific networks to formulate an end-to-end architecture, thus making our method more efficient than current multi-stage approaches by avoiding the need to pre-compute and store features on disk. Finally, the parameters of the TVNet can be further fine-tuned by end-to-end training. This enables TVNet to learn richer and task-specific patterns beyond exact optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach. Our TVNet achieves better accuracies than all compared methods, while being competitive with the fastest counterpart in terms of features extraction time.

* CVPR 2018 spotlight. The first two authors contributed equally to this paper

Via

Access Paper or Ask Questions

A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

Mar 21, 2018
Yifan Ding, Liqiang Wang, Deliang Fan, Boqing Gong

Figure 1 for A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

Figure 2 for A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

Figure 3 for A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

Figure 4 for A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

The recent success of deep neural networks is powered in part by large-scale well-labeled training data. However, it is a daunting task to laboriously annotate an ImageNet-like dateset. On the contrary, it is fairly convenient, fast, and cheap to collect training images from the Web along with their noisy labels. This signifies the need of alternative approaches to training deep neural networks using such noisy labels. Existing methods tackling this problem either try to identify and correct the wrong labels or reweigh the data terms in the loss function according to the inferred noisy rates. Both strategies inevitably incur errors for some of the data points. In this paper, we contend that it is actually better to ignore the labels of some of the data points than to keep them if the labels are incorrect, especially when the noisy rate is high. After all, the wrong labels could mislead a neural network to a bad local optimum. We suggest a two-stage framework for the learning from noisy labels. In the first stage, we identify a small portion of images from the noisy training set of which the labels are correct with a high probability. The noisy labels of the other images are ignored. In the second stage, we train a deep neural network in a semi-supervised manner. This framework effectively takes advantage of the whole training set and yet only a portion of its labels that are most likely correct. Experiments on three datasets verify the effectiveness of our approach especially when the noisy rate is high.

* IEEE Winter Conf. on Applications of Computer Vision 2018

Via

Access Paper or Ask Questions

End-to-End Video Captioning with Multitask Reinforcement Learning

Mar 21, 2018
Lijun Li, Boqing Gong

Figure 1 for End-to-End Video Captioning with Multitask Reinforcement Learning

Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input videos and output captions are lengthy sequences. Indeed, state-of-the-art methods of video captioning process video frames by convolutional neural networks and generate captions by unrolling recurrent neural networks. If we connect them in an E2E manner, the resulting model is both memory-consuming and data-hungry, making it extremely hard to train. In this paper, we propose a multitask reinforcement learning approach to training an E2E video captioning model. The main idea is to mine and construct as many effective tasks (e.g., attributes, rewards, and the captions) as possible from the human captioned videos such that they can jointly regulate the search space of the E2E neural network, from which an E2E video captioning model can be found and generalized to the testing phase. To the best of our knowledge, this is the first video captioning model that is trained end-to-end from the raw video input to the caption output. Experimental results show that such a model outperforms existing ones to a large margin on two benchmark video captioning datasets.

Via

Access Paper or Ask Questions

Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Mar 05, 2018
Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang

Figure 1 for Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Figure 2 for Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Figure 3 for Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Figure 4 for Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Despite being impactful on a variety of problems and applications, the generative adversarial nets (GANs) are remarkably difficult to train. This issue is formally analyzed by \cite{arjovsky2017towards}, who also propose an alternative direction to avoid the caveats in the minmax two-player training of GANs. The corresponding algorithm, called Wasserstein GAN (WGAN), hinges on the 1-Lipschitz continuity of the discriminator. In this paper, we propose a novel approach to enforcing the Lipschitz continuity in the training procedure of WGANs. Our approach seamlessly connects WGAN with one of the recent semi-supervised learning methods. As a result, it gives rise to not only better photo-realistic samples than the previous methods but also state-of-the-art semi-supervised learning results. In particular, our approach gives rise to the inception score of more than 5.0 with only 1,000 CIFAR-10 images and is the first that exceeds the accuracy of 90% on the CIFAR-10 dataset using only 4,000 labeled images, to the best of our knowledge.

* Accepted as a conference paper in International Conference on Learning Representation(ICLR). Xiang Wei and Boqing Gong contributed equally in this work

Via

Access Paper or Ask Questions

Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Feb 07, 2018
Adnan Siraj Rakin, Zhezhi He, Boqing Gong, Deliang Fan

Figure 1 for Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Figure 2 for Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Figure 3 for Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Figure 4 for Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Deep learning algorithms and networks are vulnerable to perturbed inputs which is known as the adversarial attack. Many defense methodologies have been investigated to defend against such adversarial attack. In this work, we propose a novel methodology to defend the existing powerful attack model. We for the first time introduce a new attacking scheme for the attacker and set a practical constraint for white box attack. Under this proposed attacking scheme, we present the best defense ever reported against some of the recent strong attacks. It consists of a set of nonlinear function to process the input data which will make it more robust over the adversarial attack. However, we make this processing layer completely hidden from the attacker. Blind pre-processing improves the white box attack accuracy of MNIST from 94.3\% to 98.7\%. Even with increasing defense when others defenses completely fail, blind pre-processing remains one of the strongest ever reported. Another strength of our defense is that it eliminates the need for adversarial training as it can significantly increase the MNIST accuracy without adversarial training as well. Additionally, blind pre-processing can also increase the inference accuracy in the face of a powerful attack on CIFAR-10 and SVHN data set as well without much sacrificing clean data accuracy.

Via

Access Paper or Ask Questions

Infinite-Label Learning with Semantic Output Codes

Oct 21, 2017
Yang Zhang, Rupam Acharyya, Ji Liu, Boqing Gong

Figure 1 for Infinite-Label Learning with Semantic Output Codes

Figure 2 for Infinite-Label Learning with Semantic Output Codes

Figure 3 for Infinite-Label Learning with Semantic Output Codes

We develop a new statistical machine learning paradigm, named infinite-label learning, to annotate a data point with more than one relevant labels from a candidate set, which pools both the finite labels observed at training and a potentially infinite number of previously unseen labels. The infinite-label learning fundamentally expands the scope of conventional multi-label learning, and better models the practical requirements in various real-world applications, such as image tagging, ads-query association, and article categorization. However, how can we learn a labeling function that is capable of assigning to a data point the labels omitted from the training set? To answer the question, we seek some clues from the recent work on zero-shot learning, where the key is to represent a class/label by a vector of semantic codes, as opposed to treating them as atomic labels. We validate the infinite-label learning by a PAC bound in theory and some empirical studies on both synthetic and real data.

Via

Access Paper or Ask Questions

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Aug 15, 2017
Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong

Figure 1 for VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Figure 2 for VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Figure 3 for VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Figure 4 for VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Rich and dense human labeled datasets are among the main enabling factors for the recent advance on vision-language understanding. Many seemingly distant annotations (e.g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e.g., of COCO). The popularity of COCO correlates those annotations and tasks. Explicitly linking them up may significantly benefit both individual tasks and the unified vision and language modeling. We present the preliminary work of linking the instance segmentations provided by COCO to the questions and answers (QAs) in the VQA dataset, and name the collected links visual questions and segmentation answers (VQS). They transfer human supervision between the previously separate tasks, offer more effective leverage to existing problems, and also open the door for new research problems and models. We study two applications of the VQS data in this paper: supervised attention for VQA and a novel question-focused semantic segmentation task. For the former, we obtain state-of-the-art results on the VQA real multiple-choice task by simply augmenting the multilayer perceptrons with some attention features that are learned using the segmentation-QA links as explicit supervision. To put the latter in perspective, we study two plausible methods and compare them to an oracle method assuming that the instance segmentations are given at the test stage.

* To appear on ICCV 2017

Via

Access Paper or Ask Questions

Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Jul 16, 2017
Aidean Sharghi, Jacob S. Laurel, Boqing Gong

Figure 1 for Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Figure 2 for Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Figure 3 for Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Figure 4 for Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Recent years have witnessed a resurgence of interest in video summarization. However, one of the main obstacles to the research on video summarization is the user subjectivity - users have various preferences over the summaries. The subjectiveness causes at least two problems. First, no single video summarizer fits all users unless it interacts with and adapts to the individual users. Second, it is very challenging to evaluate the performance of a video summarizer. To tackle the first problem, we explore the recently proposed query-focused video summarization which introduces user preferences in the form of text queries about the video into the summarization process. We propose a memory network parameterized sequential determinantal point process in order to attend the user query onto different video frames and shots. To address the second challenge, we contend that a good evaluation metric for video summarization should focus on the semantic information that humans can perceive rather than the visual features or temporal overlaps. To this end, we collect dense per-video-shot concept annotations, compile a new dataset, and suggest an efficient evaluation method defined upon the concept annotations. We conduct extensive experiments contrasting our video summarizer to existing ones and present detailed analyses about the dataset and the new evaluation method.

Via

Access Paper or Ask Questions

Improving Facial Attribute Prediction using Semantic Segmentation

Apr 27, 2017
Mahdi M. Kalayeh, Boqing Gong, Mubarak Shah

Figure 1 for Improving Facial Attribute Prediction using Semantic Segmentation

Figure 2 for Improving Facial Attribute Prediction using Semantic Segmentation

Figure 3 for Improving Facial Attribute Prediction using Semantic Segmentation

Figure 4 for Improving Facial Attribute Prediction using Semantic Segmentation

Attributes are semantically meaningful characteristics whose applicability widely crosses category boundaries. They are particularly important in describing and recognizing concepts where no explicit training example is given, \textit{e.g., zero-shot learning}. Additionally, since attributes are human describable, they can be used for efficient human-computer interaction. In this paper, we propose to employ semantic segmentation to improve facial attribute prediction. The core idea lies in the fact that many facial attributes describe local properties. In other words, the probability of an attribute to appear in a face image is far from being uniform in the spatial domain. We build our facial attribute prediction model jointly with a deep semantic segmentation network. This harnesses the localization cues learned by the semantic segmentation to guide the attention of the attribute prediction to the regions where different attributes naturally show up. As a result of this approach, in addition to recognition, we are able to localize the attributes, despite merely having access to image level labels (weak supervision) during training. We evaluate our proposed method on CelebA and LFWA datasets and achieve superior results to the prior arts. Furthermore, we show that in the reverse problem, semantic face parsing improves when facial attributes are available. That reaffirms the need to jointly model these two interconnected tasks.

Via

Access Paper or Ask Questions