Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Zhu

Generating Semantically Valid Adversarial Questions for TableQA

May 26, 2020
Yi Zhu, Menglin Xia, Yiwei Zhou

Figure 1 for Generating Semantically Valid Adversarial Questions for TableQA

Figure 2 for Generating Semantically Valid Adversarial Questions for TableQA

Figure 3 for Generating Semantically Valid Adversarial Questions for TableQA

Figure 4 for Generating Semantically Valid Adversarial Questions for TableQA

Adversarial attack on question answering systems over tabular data (TableQA) can help evaluate to what extent they can understand natural language questions and reason with tables. However, generating natural language adversarial questions is difficult, because even a single character swap could lead to huge semantic difference in human perception. In this paper, we propose SAGE (Semantically valid Adversarial GEnerator), a Wasserstein sequence-to-sequence model for TableQA white-box attack. To preserve meaning of original questions, we apply minimum risk training with SIMILE and entity delexicalization. We use Gumbel-Softmax to incorporate adversarial loss for end-to-end training. Our experiments show that SAGE outperforms existing local attack models on semantic validity and fluency while achieving a good attack success rate. Finally, we demonstrate that adversarial training with SAGE augmented data can improve performance and robustness of TableQA systems.

Via

Access Paper or Ask Questions

Revealing hidden dynamics from time-series data by ODENet

May 11, 2020
Pipi Hu, Wuyue Yang, Yi Zhu, Liu Hong

Figure 1 for Revealing hidden dynamics from time-series data by ODENet

Figure 2 for Revealing hidden dynamics from time-series data by ODENet

Figure 3 for Revealing hidden dynamics from time-series data by ODENet

Figure 4 for Revealing hidden dynamics from time-series data by ODENet

To understand the hidden physical concepts from observed data is the most basic but challenging problem in many fields. In this study, we propose a new type of interpretable neural network called the ordinary differential equation network (ODENet) to reveal the hidden dynamics buried in the massive time-series data. Specifically, we construct explicit models presented by ordinary differential equations (ODEs) to describe the observed data without any prior knowledge. In contrast to other previous neural networks which are black boxes for users, the ODENet in this work is an imitation of the difference scheme for ODEs, with each step computed by an ODE solver, and thus is completely understandable. Backpropagation algorithms are used to update the coefficients of a group of orthogonal basis functions, which specify the concrete form of ODEs, under the guidance of loss function with sparsity requirement. From classical Lotka-Volterra equations to chaotic Lorenz equations, the ODENet demonstrates its remarkable capability to deal with time-series data. In the end, we apply the ODENet to real actin aggregation data observed by experimentalists, and it shows an impressive performance as well.

Via

Access Paper or Ask Questions

Improving Semantic Segmentation via Self-Training

May 06, 2020
Yi Zhu, Zhongyue Zhang, Chongruo Wu, Zhi Zhang, Tong He, Hang Zhang, R. Manmatha, Mu Li, Alexander Smola

Figure 1 for Improving Semantic Segmentation via Self-Training

Figure 2 for Improving Semantic Segmentation via Self-Training

Figure 3 for Improving Semantic Segmentation via Self-Training

Figure 4 for Improving Semantic Segmentation via Self-Training

Deep learning usually achieves the best results with complete supervision. In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models. In this paper, we show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm. We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data. Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets while requiring significantly less supervision. We also demonstrate the effectiveness of self-training on a challenging cross-domain generalization task, outperforming conventional finetuning method by a large margin. Lastly, to alleviate the computational burden caused by the large amount of pseudo labels, we propose a fast training schedule to accelerate the training of segmentation models by up to 2x without performance degradation.

Via

Access Paper or Ask Questions

ResNeSt: Split-Attention Networks

Apr 19, 2020
Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola

Figure 1 for ResNeSt: Split-Attention Networks

Figure 2 for ResNeSt: Split-Attention Networks

Figure 3 for ResNeSt: Split-Attention Networks

Figure 4 for ResNeSt: Split-Attention Networks

While image classification models have recently continued to advance, most downstream applications such as object detection and semantic segmentation still employ ResNet variants as the backbone network due to their simple and modular structure. We present a simple and modular Split-Attention block that enables attention across feature-map groups. By stacking these Split-Attention blocks ResNet-style, we obtain a new ResNet variant which we call ResNeSt. Our network preserves the overall ResNet structure to be used in downstream tasks straightforwardly without introducing additional computational costs. ResNeSt models outperform other networks with similar model complexities. For example, ResNeSt-50 achieves 81.13% top-1 accuracy on ImageNet using a single crop-size of 224x224, outperforming previous best ResNet variant by more than 1% accuracy. This improvement also helps downstream tasks including object detection, instance segmentation and semantic segmentation. For example, by simply replace the ResNet-50 backbone with ResNeSt-50, we improve the mAP of Faster-RCNN on MS-COCO from 39.3% to 42.3% and the mIoU for DeeplabV3 on ADE20K from 42.1% to 45.1%.

Via

Access Paper or Ask Questions

Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

Mar 17, 2020
Hu Zhang, Linchao Zhu, Yi Zhu, Yi Yang

Figure 1 for Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

Figure 2 for Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

Figure 3 for Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

Figure 4 for Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

Deep neural networks are known to be susceptible to adversarial noise, which are tiny and imperceptible perturbations. Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored. In this paper, we aim to attack video models by utilizing intrinsic movement pattern and regional relative motion among video frames. We propose an effective motion-excited sampler to obtain motion-aware noise prior, which we term as sparked prior. Our sparked prior underlines frame correlations and utilizes video dynamics via relative motion. By using the sparked prior in gradient estimation, we can successfully attack a variety of video classification models with fewer number of queries. Extensive experimental results on four benchmark datasets validate the efficacy of our proposed method.

Via

Access Paper or Ask Questions

Vision-Dialog Navigation by Exploring Cross-modal Memory

Mar 15, 2020
Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, Xiaodan Liang

Figure 1 for Vision-Dialog Navigation by Exploring Cross-modal Memory

Figure 2 for Vision-Dialog Navigation by Exploring Cross-modal Memory

Figure 3 for Vision-Dialog Navigation by Exploring Cross-modal Memory

Figure 4 for Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.

* CVPR2020

Via

Access Paper or Ask Questions

**Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling**

Dec 23, 2019
Xueqing Deng, Yi Zhu, Yuxin Tian, Shawn Newsam

Figure 1 for Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling

Figure 2 for Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling

Figure 3 for Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling

Figure 4 for Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling

That most deep learning models are purely data driven is both a strength and a weakness. Given sufficient training data, the optimal model for a particular problem can be learned. However, this is usually not the case and so instead the model is either learned from scratch from a limited amount of training data or pre-trained on a different problem and then fine-tuned. Both of these situations are potentially suboptimal and limit the generalizability of the model. Inspired by this, we investigate methods to inform or guide deep learning models for geospatial image analysis to increase their performance when a limited amount of training data is available or when they are applied to scenarios other than which they were trained on. In particular, we exploit the fact that there are certain fundamental rules as to how things are distributed on the surface of the Earth and these rules do not vary substantially between locations. Based on this, we develop a novel feature pooling method for convolutional neural networks using Getis-Ord Gi* analysis from geostatistics. Experimental results show our proposed pooling function has significantly better generalization performance compared to a standard data-driven approach when applied to overhead image segmentation.

Via

Access Paper or Ask Questions

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Nov 28, 2019
Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang

Figure 1 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 2 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 3 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 4 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Vision-Language Navigation (VLN) is a task where agents learn to navigate following natural language instructions. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches exploit the vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have neglected the rich semantic information contained in the environment (such as implicit navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, estimating the navigation progress, predicting the next orientation, and evaluating the trajectory consistency. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activity and build a thorough perception of the environment. Our experiments indicate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. Empirically, we demonstrate that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

Via

Access Paper or Ask Questions