Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingkuan Song

Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

Mar 09, 2022

Qilong Zhang, Chaoning Zhang, Chaoqun Li, Jingkuan Song, Lianli Gao, Heng Tao Shen

Figure 1 for Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

Figure 2 for Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

Figure 3 for Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

Figure 4 for Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

Abstract:In recent years, the adversarial vulnerability of deep neural networks (DNNs) has raised increasing attention. Among all the threat models, no-box attacks are the most practical but extremely challenging since they neither rely on any knowledge of the target model or similar substitute model, nor access the dataset for training a new substitute model. Although a recent method has attempted such an attack in a loose sense, its performance is not good enough and computational overhead of training is expensive. In this paper, we move a step forward and show the existence of a \textbf{training-free} adversarial perturbation under the no-box threat model, which can be successfully used to attack different DNNs in real-time. Motivated by our observation that high-frequency component (HFC) domains in low-level features and plays a crucial role in classification, we attack an image mainly by manipulating its frequency components. Specifically, the perturbation is manipulated by suppression of the original HFC and adding of noisy HFC. We empirically and experimentally analyze the requirements of effective noisy HFC and show that it should be regionally homogeneous, repeating and dense. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our proposed no-box method. It attacks ten well-known models with a success rate of \textbf{98.13\%} on average, which outperforms state-of-the-art no-box attacks by \textbf{29.39\%}. Furthermore, our method is even competitive to mainstream transfer-based black-box attacks.

* This is the revision (the previous version rated 8,8,5,4 in ICLR2022, where 8 denotes "accept, good paper"), which has been further polished and added many new experiments

Via

Access Paper or Ask Questions

One-shot Scene Graph Generation

Feb 26, 2022

Yuyu Guo, Jingkuan Song, Lianli Gao, Heng Tao Shen

Figure 1 for One-shot Scene Graph Generation

Figure 2 for One-shot Scene Graph Generation

Figure 3 for One-shot Scene Graph Generation

Figure 4 for One-shot Scene Graph Generation

Abstract:As a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or hundreds of labeled samples. By contrast, human beings can learn visual relationships from a few or even one example. Inspired by this, we design a task named One-Shot Scene Graph Generation, where each relationship triplet (e.g., "dog-has-head") comes from only one labeled example. The key insight is that rather than learning from scratch, one can utilize rich prior knowledge. In this paper, we propose Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task. Specifically, the Relational Knowledge represents the prior knowledge of relationships between entities extracted from the visual content, e.g., the visual relationships "standing in", "sitting in", and "lying in" may exist between "dog" and "yard", while the Commonsense Knowledge encodes "sense-making" knowledge like "dog can guard yard". By organizing these two kinds of knowledge in a graph structure, Graph Convolution Networks (GCNs) are used to extract knowledge-embedded semantic features of the entities. Besides, instead of extracting isolated visual features from each entity generated by Faster R-CNN, we utilize an Instance Relation Transformer encoder to fully explore their context information. Based on a constructed one-shot dataset, the experimental results show that our method significantly outperforms existing state-of-the-art methods by a large margin. Ablation studies also verify the effectiveness of the Instance Relation Transformer encoder and the Multiple Structured Knowledge.

Via

Access Paper or Ask Questions

Relation Regularized Scene Graph Generation

Feb 22, 2022

Yuyu Guo, Lianli Gao, Jingkuan Song, Peng Wang, Nicu Sebe, Heng Tao Shen, Xuelong Li

Figure 1 for Relation Regularized Scene Graph Generation

Figure 2 for Relation Regularized Scene Graph Generation

Figure 3 for Relation Regularized Scene Graph Generation

Figure 4 for Relation Regularized Scene Graph Generation

Abstract:Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.

Via

Access Paper or Ask Questions

Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains

Feb 10, 2022

Qilong Zhang, Xiaodan Li, Yuefeng Chen, Jingkuan Song, Lianli Gao, Yuan He, Hui Xue

Figure 1 for Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains

Figure 2 for Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains

Figure 3 for Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains

Figure 4 for Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains

Abstract:Adversarial examples have posed a severe threat to deep neural networks due to their transferable nature. Currently, various works have paid great efforts to enhance the cross-model transferability, which mostly assume the substitute model is trained in the same domain as the target model. However, in reality, the relevant information of the deployed model is unlikely to leak. Hence, it is vital to build a more practical black-box threat model to overcome this limitation and evaluate the vulnerability of deployed models. In this paper, with only the knowledge of the ImageNet domain, we propose a Beyond ImageNet Attack (BIA) to investigate the transferability towards black-box domains (unknown classification tasks). Specifically, we leverage a generative model to learn the adversarial function for disrupting low-level features of input images. Based on this framework, we further propose two variants to narrow the gap between the source and target domains from the data and model perspectives, respectively. Extensive experiments on coarse-grained and fine-grained domains demonstrate the effectiveness of our proposed methods. Notably, our methods outperform state-of-the-art approaches by up to 7.71\% (towards coarse-grained domains) and 25.91\% (towards fine-grained domains) on average. Our code is available at \url{https://github.com/qilong-zhang/Beyond-ImageNet-Attack}.

* Accepted by ICLR 2022

Via

Access Paper or Ask Questions

Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Nov 05, 2021

Xuanhan Wang, Xiaojia Chen, Lianli Gao, Lechao Chen, Jingkuan Song

Figure 1 for Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Figure 2 for Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Figure 3 for Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Figure 4 for Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Abstract:Part-level Action Parsing aims at part state parsing for boosting action recognition in videos. Despite of dramatic progresses in the area of video classification research, a severe problem faced by the community is that the detailed understanding of human actions is ignored. Our motivation is that parsing human actions needs to build models that focus on the specific problem. We present a simple yet effective approach, named disentangled action parsing (DAP). Specifically, we divided the part-level action parsing into three stages: 1) person detection, where a person detector is adopted to detect all persons from videos as well as performs instance-level action recognition; 2) Part parsing, where a part-parsing model is proposed to recognize human parts from detected person images; and 3) Action parsing, where a multi-modal action parsing network is used to parse action category conditioning on all detection results that are obtained from previous stages. With these three major models applied, our approach of DAP records a global mean of $0.605$ score in 2021 Kinetics-TPS Challenge.

Via

Access Paper or Ask Questions

Fast Gradient Non-sign Methods

Oct 25, 2021

Yaya Cheng, Xiaosu Zhu, Qilong Zhang, Lianli Gao, Jingkuan Song

Figure 1 for Fast Gradient Non-sign Methods

Figure 2 for Fast Gradient Non-sign Methods

Figure 3 for Fast Gradient Non-sign Methods

Figure 4 for Fast Gradient Non-sign Methods

Abstract:Adversarial attacks make their success in \enquote{fooling} DNNs and among them, gradient-based algorithms become one of the mainstreams. Based on the linearity hypothesis~\cite{fgsm}, under $\ell_\infty$ constraint, $sign$ operation applied to the gradients is a good choice for generating perturbations. However, the side-effect from such operation exists since it leads to the bias of direction between the real gradients and the perturbations. In other words, current methods contain a gap between real gradients and actual noises, which leads to biased and inefficient attacks. Therefore in this paper, based on the Taylor expansion, the bias is analyzed theoretically and the correction of $\sign$, \ie, Fast Gradient Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks with negligible extra computational cost. Extensive experiments demonstrate the effectiveness of our methods. Specifically, ours outperform them by \textbf{27.5\%} at most and \textbf{9.5\%} on average. Our anonymous code is publicly available: \url{https://git.io/mm-fgnm}.

Via

Access Paper or Ask Questions

From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Aug 30, 2021

Yuyu Guo, Lianli Gao, Xuanhan Wang, Yuxuan Hu, Xing Xu, Xu Lu, Heng Tao Shen, Jingkuan Song

Figure 1 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 2 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 3 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Figure 4 for From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Abstract:The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for scene understanding. However, current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones, e.g., "standing on" and "looking at", resulting in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "blocking" to describe an image, it is easy to misunderstand the scene. We argue that this phenomenon is caused by two key imbalances between informative predicates and common ones, i.e., semantic space level imbalance and training sample level imbalance. To tackle this problem, we propose BA-SGG, a simple yet effective SGG framework based on balance adjustment but not the conventional distribution fitting. It integrates two components: Semantic Adjustment (SA) and Balanced Predicate Learning (BPL), respectively for adjusting these imbalances. Benefited from the model-agnostic process, our method is easily applied to the state-of-the-art SGG models and significantly improves the SGG performance. Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome, respectively. Codes are publicly available.

Via

Access Paper or Ask Questions

Unsupervised Domain-adaptive Hash for Networks

Aug 20, 2021

Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li

Figure 1 for Unsupervised Domain-adaptive Hash for Networks

Figure 2 for Unsupervised Domain-adaptive Hash for Networks

Figure 3 for Unsupervised Domain-adaptive Hash for Networks

Figure 4 for Unsupervised Domain-adaptive Hash for Networks

Abstract:Abundant real-world data can be naturally represented by large-scale networks, which demands efficient and effective learning algorithms. At the same time, labels may only be available for some networks, which demands these algorithms to be able to adapt to unlabeled networks. Domain-adaptive hash learning has enjoyed considerable success in the computer vision community in many practical tasks due to its lower cost in both retrieval time and storage footprint. However, it has not been applied to multiple-domain networks. In this work, we bridge this gap by developing an unsupervised domain-adaptive hash learning method for networks, dubbed UDAH. Specifically, we develop four {task-specific yet correlated} components: (1) network structure preservation via a hard groupwise contrastive loss, (2) relaxation-free supervised hashing, (3) cross-domain intersected discriminators, and (4) semantic center alignment. We conduct a wide range of experiments to evaluate the effectiveness and efficiency of our method on a range of tasks including link prediction, node classification, and neighbor recommendation. Our evaluation results demonstrate that our model achieves better performance than the state-of-the-art conventional discrete embedding methods over all the tasks.

Via

Access Paper or Ask Questions

Semi-supervised Network Embedding with Differentiable Deep Quantisation

Aug 20, 2021

Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li

Figure 1 for Semi-supervised Network Embedding with Differentiable Deep Quantisation

Figure 2 for Semi-supervised Network Embedding with Differentiable Deep Quantisation

Figure 3 for Semi-supervised Network Embedding with Differentiable Deep Quantisation

Figure 4 for Semi-supervised Network Embedding with Differentiable Deep Quantisation

Abstract:Learning accurate low-dimensional embeddings for a network is a crucial task as it facilitates many downstream network analytics tasks. For large networks, the trained embeddings often require a significant amount of space to store, making storage and processing a challenge. Building on our previous work on semi-supervised network embedding, we develop d-SNEQ, a differentiable DNN-based quantisation method for network embedding. d-SNEQ incorporates a rank loss to equip the learned quantisation codes with rich high-order information and is able to substantially compress the size of trained embeddings, thus reducing storage footprint and accelerating retrieval speed. We also propose a new evaluation metric, path prediction, to fairly and more directly evaluate model performance on the preservation of high-order information. Our evaluation on four real-world networks of diverse characteristics shows that d-SNEQ outperforms a number of state-of-the-art embedding methods in link prediction, path prediction, node classification, and node recommendation while being far more space- and time-efficient.

Via

Access Paper or Ask Questions

Semantic Compositional Learning for Low-shot Scene Graph Generation

Aug 19, 2021

Tao He, Lianli Gao, Jingkuan Song, Jianfei Cai, Yuan-Fang Li

Figure 1 for Semantic Compositional Learning for Low-shot Scene Graph Generation

Figure 2 for Semantic Compositional Learning for Low-shot Scene Graph Generation

Figure 3 for Semantic Compositional Learning for Low-shot Scene Graph Generation

Figure 4 for Semantic Compositional Learning for Low-shot Scene Graph Generation

Abstract:Scene graphs provide valuable information to many downstream tasks. Many scene graph generation (SGG) models solely use the limited annotated relation triples for training, leading to their underperformance on low-shot (few and zero) scenarios, especially on the rare predicates. To address this problem, we propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples with objects from different images. Specifically, our strategy decomposes a relation triple by identifying and removing the unessential component and composes a new relation triple by fusing with a semantically or visually similar object from a visual components dictionary, whilst ensuring the realisticity of the newly composed triple. Notably, our strategy is generic and can be combined with existing SGG models to significantly improve their performance. We performed a comprehensive evaluation on the benchmark dataset Visual Genome. For three recent SGG models, adding our strategy improves their performance by close to 50\%, and all of them substantially exceed the current state-of-the-art.

Via

Access Paper or Ask Questions