Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Heng Tao Shen

Relation Regularized Scene Graph Generation

Feb 22, 2022

Yuyu Guo, Lianli Gao, Jingkuan Song, Peng Wang, Nicu Sebe, Heng Tao Shen, Xuelong Li

Figure 1 for Relation Regularized Scene Graph Generation

Figure 2 for Relation Regularized Scene Graph Generation

Figure 3 for Relation Regularized Scene Graph Generation

Figure 4 for Relation Regularized Scene Graph Generation

Abstract:Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.

Via

Access Paper or Ask Questions

From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Aug 30, 2021

Yuyu Guo, Lianli Gao, Xuanhan Wang, Yuxuan Hu, Xing Xu, Xu Lu, Heng Tao Shen, Jingkuan Song

Figure 1 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 2 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 3 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 4 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Abstract:The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for scene understanding. However, current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones, e.g., "standing on" and "looking at", resulting in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "blocking" to describe an image, it is easy to misunderstand the scene. We argue that this phenomenon is caused by two key imbalances between informative predicates and common ones, i.e., semantic space level imbalance and training sample level imbalance. To tackle this problem, we propose BA-SGG, a simple yet effective SGG framework based on balance adjustment but not the conventional distribution fitting. It integrates two components: Semantic Adjustment (SA) and Balanced Predicate Learning (BPL), respectively for adjusting these imbalances. Benefited from the model-agnostic process, our method is easily applied to the state-of-the-art SGG models and significantly improves the SGG performance. Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome, respectively. Codes are publicly available.

Via

Access Paper or Ask Questions

Adversarial Energy Disaggregation for Non-intrusive Load Monitoring

Aug 02, 2021

Zhekai Du, Jingjing Li, Lei Zhu, Ke Lu, Heng Tao Shen

Figure 1 for Adversarial Energy Disaggregation for Non-intrusive Load Monitoring

Figure 2 for Adversarial Energy Disaggregation for Non-intrusive Load Monitoring

Figure 3 for Adversarial Energy Disaggregation for Non-intrusive Load Monitoring

Figure 4 for Adversarial Energy Disaggregation for Non-intrusive Load Monitoring

Abstract:Energy disaggregation, also known as non-intrusive load monitoring (NILM), challenges the problem of separating the whole-home electricity usage into appliance-specific individual consumptions, which is a typical application of data analysis. {NILM aims to help households understand how the energy is used and consequently tell them how to effectively manage the energy, thus allowing energy efficiency which is considered as one of the twin pillars of sustainable energy policy (i.e., energy efficiency and renewable energy).} Although NILM is unidentifiable, it is widely believed that the NILM problem can be addressed by data science. Most of the existing approaches address the energy disaggregation problem by conventional techniques such as sparse coding, non-negative matrix factorization, and hidden Markov model. Recent advances reveal that deep neural networks (DNNs) can get favorable performance for NILM since DNNs can inherently learn the discriminative signatures of the different appliances. In this paper, we propose a novel method named adversarial energy disaggregation (AED) based on DNNs. We introduce the idea of adversarial learning into NILM, which is new for the energy disaggregation task. Our method trains a generator and multiple discriminators via an adversarial fashion. The proposed method not only learns shard representations for different appliances, but captures the specific multimode structures of each appliance. Extensive experiments on real-world datasets verify that our method can achieve new state-of-the-art performance.

* Accepted to ACM/IMS Trans. on Data Science, codes can be found at https://github.com/lijin118/AED

Via

Access Paper or Ask Questions

Staircase Sign Method for Boosting Adversarial Attacks

Apr 20, 2021

Lianli Gao, Qilong Zhang, Xiaosu Zhu, Jingkuan Song, Heng Tao Shen

Figure 1 for Staircase Sign Method for Boosting Adversarial Attacks

Figure 2 for Staircase Sign Method for Boosting Adversarial Attacks

Figure 3 for Staircase Sign Method for Boosting Adversarial Attacks

Figure 4 for Staircase Sign Method for Boosting Adversarial Attacks

Abstract:Crafting adversarial examples for the transfer-based attack is challenging and remains a research hot spot. Currently, such attack methods are based on the hypothesis that the substitute model and the victim's model learn similar decision boundaries, and they conventionally apply Sign Method (SM) to manipulate the gradient as the resultant perturbation. Although SM is efficient, it only extracts the sign of gradient units but ignores their value difference, which inevitably leads to a serious deviation. Therefore, we propose a novel Staircase Sign Method (S$^2$M) to alleviate this issue, thus boosting transfer-based attacks. Technically, our method heuristically divides the gradient sign into several segments according to the values of the gradient units, and then assigns each segment with a staircase weight for better crafting adversarial perturbation. As a result, our adversarial examples perform better in both white-box and black-box manner without being more visible. Since S$^2$M just manipulates the resultant gradient, our method can be generally integrated into any transfer-based attacks, and the computational overhead is negligible. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our proposed methods, which significantly improve the transferability (i.e., on average, \textbf{5.1\%} for normally trained models and \textbf{11.2\%} for adversarially trained defenses). Our code is available at: \url{https://github.com/qilong-zhang/Staircase-sign-method}.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions

Patch-wise++ Perturbation for Adversarial Targeted Attacks

Jan 07, 2021

Lianli Gao, Qilong Zhang, Jingkuan Song, Heng Tao Shen

Figure 1 for Patch-wise++ Perturbation for Adversarial Targeted Attacks

Figure 2 for Patch-wise++ Perturbation for Adversarial Targeted Attacks

Figure 3 for Patch-wise++ Perturbation for Adversarial Targeted Attacks

Figure 4 for Patch-wise++ Perturbation for Adversarial Targeted Attacks

Abstract:Although great progress has been made on adversarial attacks for deep neural networks (DNNs), their transferability is still unsatisfactory, especially for targeted attacks. There are two problems behind that have been long overlooked: 1) the conventional setting of $T$ iterations with the step size of $\epsilon/T$ to comply with the $\epsilon$-constraint. In this case, most of the pixels are allowed to add very small noise, much less than $\epsilon$; and 2) usually manipulating pixel-wise noise. However, features of a pixel extracted by DNNs are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. To tackle these issues, we propose a patch-wise iterative method (PIM) aimed at crafting adversarial examples with high transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. But targeted attacks aim to push the adversarial examples into the territory of a specific class, and the amplification factor may lead to underfitting. Thus, we introduce the temperature and propose a patch-wise++ iterative method (PIM++) to further improve transferability without significantly sacrificing the performance of the white-box attack. Our method can be generally integrated to any gradient-based attack method. Compared with the current state-of-the-art attack methods, we significantly improve the success rate by 35.9\% for defense models and 32.7\% for normally trained models on average.

* 12 pages, 9 figures. arXiv admin note: text overlap with arXiv:2007.06765

Via

Access Paper or Ask Questions

Dual ResGCN for Balanced Scene GraphGeneration

Nov 09, 2020

Jingyi Zhang, Yong Zhang, Baoyuan Wu, Yanbo Fan, Fumin Shen, Heng Tao Shen

Figure 1 for Dual ResGCN for Balanced Scene GraphGeneration

Figure 2 for Dual ResGCN for Balanced Scene GraphGeneration

Figure 3 for Dual ResGCN for Balanced Scene GraphGeneration

Figure 4 for Dual ResGCN for Balanced Scene GraphGeneration

Abstract:Visual scene graph generation is a challenging task. Previous works have achieved great progress, but most of them do not explicitly consider the class imbalance issue in scene graph generation. Models learned without considering the class imbalance tend to predict the majority classes, which leads to a good performance on trivial frequent predicates, but poor performance on informative infrequent predicates. However, predicates of minority classes often carry more semantic and precise information~(\textit{e.g.}, \emph{`on'} v.s \emph{`parked on'}). % which leads to a good score of recall, but a poor score of mean recall. To alleviate the influence of the class imbalance, we propose a novel model, dubbed \textit{dual ResGCN}, which consists of an object residual graph convolutional network and a relation residual graph convolutional network. The two networks are complementary to each other. The former captures object-level context information, \textit{i.e.,} the connections among objects. We propose a novel ResGCN that enhances object features in a cross attention manner. Besides, we stack multiple contextual coefficients to alleviate the imbalance issue and enrich the prediction diversity. The latter is carefully designed to explicitly capture relation-level context information \textit{i.e.,} the connections among relations. We propose to incorporate the prior about the co-occurrence of relation pairs into the graph to further help alleviate the class imbalance issue. Extensive evaluations of three tasks are performed on the large-scale database VG to demonstrate the superiority of the proposed method.

Via

Access Paper or Ask Questions

Universal Weighting Metric Learning for Cross-Modal Matching

Oct 07, 2020

Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen

Figure 1 for Universal Weighting Metric Learning for Cross-Modal Matching

Figure 2 for Universal Weighting Metric Learning for Cross-Modal Matching

Figure 3 for Universal Weighting Metric Learning for Cross-Modal Matching

Figure 4 for Universal Weighting Metric Learning for Cross-Modal Matching

Abstract:Cross-modal matching has been a highlighted research topic in both vision and language areas. Learning appropriate mining strategy to sample and weight informative pairs is crucial for the cross-modal matching performance. However, most existing metric learning methods are developed for unimodal matching, which is unsuitable for cross-modal matching on multimodal data with heterogeneous features. To address this problem, we propose a simple and interpretable universal weighting framework for cross-modal matching, which provides a tool to analyze the interpretability of various loss functions. Furthermore, we introduce a new polynomial loss under the universal weighting framework, which defines a weight function for the positive and negative informative pairs respectively. Experimental results on two image-text matching benchmarks and two video-text matching benchmarks validate the efficacy of the proposed method.

Via

Access Paper or Ask Questions

Patch-wise Attack for Fooling Deep Neural Network

Jul 16, 2020

Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, Heng Tao Shen

Figure 1 for Patch-wise Attack for Fooling Deep Neural Network

Figure 2 for Patch-wise Attack for Fooling Deep Neural Network

Figure 3 for Patch-wise Attack for Fooling Deep Neural Network

Figure 4 for Patch-wise Attack for Fooling Deep Neural Network

Abstract:By adding human-imperceptible noise to clean images, the resultant adversarial examples can fool other unknown models. Features of a pixel extracted by deep neural networks (DNNs) are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. Motivated by this, we propose a patch-wise iterative algorithm -- a black-box attack towards mainstream normally trained and defense models, which differs from the existing attack methods manipulating pixel-wise noise. In this way, without sacrificing the performance of white-box attack, our adversarial examples can have strong transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. Our method can be generally integrated to any gradient-based attack methods. Compared with the current state-of-the-art attacks, we significantly improve the success rate by 9.2\% for defense models and 3.7\% for normally trained models on average. Our code is available at \url{https://github.com/qilong-zhang/Patch-wise-iterative-attack}

* Accepted by ECCV 2020

Via

Access Paper or Ask Questions

Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Apr 29, 2020

Yunlian Lv, Ning Xie, Yimin Shi, Zijiao Wang, Heng Tao Shen

Figure 1 for Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Figure 2 for Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Figure 3 for Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Figure 4 for Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Abstract:Embodied artificial intelligence (AI) tasks shift from tasks focusing on internet images to active settings involving embodied agents that perceive and act within 3D environments. In this paper, we investigate the target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes, whose navigation task aims to train an agent that can intelligently make a series of decisions to arrive at a pre-specified target location from any possible starting positions only based on egocentric views. However, most navigation methods currently struggle against several challenging problems, such as data efficiency, automatic obstacle avoidance, and generalization. Generalization problem means that agent does not have the ability to transfer navigation skills learned from previous experience to unseen targets and scenes. To address these issues, we incorporate two designs into classic DRL framework: attention on 3D knowledge graph (KG) and target skill extension (TSE) module. On the one hand, our proposed method combines visual features and 3D spatial representations to learn navigation policy. On the other hand, TSE module is used to generate sub-targets which allow agent to learn from failures. Specifically, our 3D spatial relationships are encoded through recently popular graph convolutional network (GCN). Considering the real world settings, our work also considers open action and adds actionable targets into conventional navigation situations. Those more difficult settings are applied to test whether DRL agent really understand its task, navigating environment, and can carry out reasoning. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics, and improves generalization ability across targets and scenes.

* 12 pages, 9 figures

Via

Access Paper or Ask Questions

Cooperative Cross-Stream Network for Discriminative Action Representation

Aug 27, 2019

Jingran Zhang, Fumin Shen, Xing Xu, Heng Tao Shen

Figure 1 for Cooperative Cross-Stream Network for Discriminative Action Representation

Figure 2 for Cooperative Cross-Stream Network for Discriminative Action Representation

Figure 3 for Cooperative Cross-Stream Network for Discriminative Action Representation

Figure 4 for Cooperative Cross-Stream Network for Discriminative Action Representation

Abstract:Spatial and temporal stream model has gained great success in video action recognition. Most existing works pay more attention to designing effective features fusion methods, which train the two-stream model in a separate way. However, it's hard to ensure discriminability and explore complementary information between different streams in existing works. In this work, we propose a novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities. The jointly spatial and temporal stream networks feature extraction is accomplished by an end-to-end learning manner. It extracts this complementary information of different modality from a connection block, which aims at exploring correlations of different stream features. Furthermore, different from the conventional ConvNet that learns the deep separable features with only one cross-entropy loss, our proposed model enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities. The modality ranking constraint constitutes intra-modality discriminative embedding and inter-modality triplet constraint, and it reduces both the intra-modality and cross-modality feature variations. Experiments on three benchmark datasets demonstrate that by cooperating appearance and motion feature extraction, our method can achieve state-of-the-art or competitive performance compared with existing results.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions