Abstract:Adversarial attacks make their success in \enquote{fooling} DNNs and among them, gradient-based algorithms become one of the mainstreams. Based on the linearity hypothesis~\cite{fgsm}, under $\ell_\infty$ constraint, $sign$ operation applied to the gradients is a good choice for generating perturbations. However, the side-effect from such operation exists since it leads to the bias of direction between the real gradients and the perturbations. In other words, current methods contain a gap between real gradients and actual noises, which leads to biased and inefficient attacks. Therefore in this paper, based on the Taylor expansion, the bias is analyzed theoretically and the correction of $\sign$, \ie, Fast Gradient Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks with negligible extra computational cost. Extensive experiments demonstrate the effectiveness of our methods. Specifically, ours outperform them by \textbf{27.5\%} at most and \textbf{9.5\%} on average. Our anonymous code is publicly available: \url{https://git.io/mm-fgnm}.
Abstract:The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for scene understanding. However, current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones, e.g., "standing on" and "looking at", resulting in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "blocking" to describe an image, it is easy to misunderstand the scene. We argue that this phenomenon is caused by two key imbalances between informative predicates and common ones, i.e., semantic space level imbalance and training sample level imbalance. To tackle this problem, we propose BA-SGG, a simple yet effective SGG framework based on balance adjustment but not the conventional distribution fitting. It integrates two components: Semantic Adjustment (SA) and Balanced Predicate Learning (BPL), respectively for adjusting these imbalances. Benefited from the model-agnostic process, our method is easily applied to the state-of-the-art SGG models and significantly improves the SGG performance. Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome, respectively. Codes are publicly available.
Abstract:Abundant real-world data can be naturally represented by large-scale networks, which demands efficient and effective learning algorithms. At the same time, labels may only be available for some networks, which demands these algorithms to be able to adapt to unlabeled networks. Domain-adaptive hash learning has enjoyed considerable success in the computer vision community in many practical tasks due to its lower cost in both retrieval time and storage footprint. However, it has not been applied to multiple-domain networks. In this work, we bridge this gap by developing an unsupervised domain-adaptive hash learning method for networks, dubbed UDAH. Specifically, we develop four {task-specific yet correlated} components: (1) network structure preservation via a hard groupwise contrastive loss, (2) relaxation-free supervised hashing, (3) cross-domain intersected discriminators, and (4) semantic center alignment. We conduct a wide range of experiments to evaluate the effectiveness and efficiency of our method on a range of tasks including link prediction, node classification, and neighbor recommendation. Our evaluation results demonstrate that our model achieves better performance than the state-of-the-art conventional discrete embedding methods over all the tasks.
Abstract:Learning accurate low-dimensional embeddings for a network is a crucial task as it facilitates many downstream network analytics tasks. For large networks, the trained embeddings often require a significant amount of space to store, making storage and processing a challenge. Building on our previous work on semi-supervised network embedding, we develop d-SNEQ, a differentiable DNN-based quantisation method for network embedding. d-SNEQ incorporates a rank loss to equip the learned quantisation codes with rich high-order information and is able to substantially compress the size of trained embeddings, thus reducing storage footprint and accelerating retrieval speed. We also propose a new evaluation metric, path prediction, to fairly and more directly evaluate model performance on the preservation of high-order information. Our evaluation on four real-world networks of diverse characteristics shows that d-SNEQ outperforms a number of state-of-the-art embedding methods in link prediction, path prediction, node classification, and node recommendation while being far more space- and time-efficient.
Abstract:Scene graphs provide valuable information to many downstream tasks. Many scene graph generation (SGG) models solely use the limited annotated relation triples for training, leading to their underperformance on low-shot (few and zero) scenarios, especially on the rare predicates. To address this problem, we propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples with objects from different images. Specifically, our strategy decomposes a relation triple by identifying and removing the unessential component and composes a new relation triple by fusing with a semantically or visually similar object from a visual components dictionary, whilst ensuring the realisticity of the newly composed triple. Notably, our strategy is generic and can be combined with existing SGG models to significantly improve their performance. We performed a comprehensive evaluation on the benchmark dataset Visual Genome. For three recent SGG models, adding our strategy improves their performance by close to 50\%, and all of them substantially exceed the current state-of-the-art.
Abstract:Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects. Existing works focus on the visual and linguistic features of humans and objects. However, they do not capitalise on the high-level and semantic relationships present in the image, which provides crucial contextual and detailed relational knowledge for HOI inference. We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task. Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions. Empirical evaluation shows that our SG2HOI method outperforms the state-of-the-art methods on two benchmark HOI datasets: V-COCO and HICO-DET. Code will be available at https://github.com/ht014/SG2HOI.
Abstract:By adding human-imperceptible perturbations to images, DNNs can be easily fooled. As one of the mainstream methods, feature space targeted attacks perturb images by modulating their intermediate feature maps, for the discrepancy between the intermediate source and target features is minimized. However, the current choice of pixel-wise Euclidean Distance to measure the discrepancy is questionable because it unreasonably imposes a spatial-consistency constraint on the source and target features. Intuitively, an image can be categorized as "cat" no matter the cat is on the left or right of the image. To address this issue, we propose to measure this discrepancy using statistic alignment. Specifically, we design two novel approaches called Pair-wise Alignment Attack and Global-wise Alignment Attack, which attempt to measure similarities between feature maps by high-order statistics with translation invariance. Furthermore, we systematically analyze the layer-wise transferability with varied difficulties to obtain highly reliable attacks. Extensive experiments verify the effectiveness of our proposed method, and it outperforms the state-of-the-art algorithms by a large margin. Our code is publicly available at https://github.com/yaya-cheng/PAA-GAA.
Abstract:Crafting adversarial examples for the transfer-based attack is challenging and remains a research hot spot. Currently, such attack methods are based on the hypothesis that the substitute model and the victim's model learn similar decision boundaries, and they conventionally apply Sign Method (SM) to manipulate the gradient as the resultant perturbation. Although SM is efficient, it only extracts the sign of gradient units but ignores their value difference, which inevitably leads to a serious deviation. Therefore, we propose a novel Staircase Sign Method (S$^2$M) to alleviate this issue, thus boosting transfer-based attacks. Technically, our method heuristically divides the gradient sign into several segments according to the values of the gradient units, and then assigns each segment with a staircase weight for better crafting adversarial perturbation. As a result, our adversarial examples perform better in both white-box and black-box manner without being more visible. Since S$^2$M just manipulates the resultant gradient, our method can be generally integrated into any transfer-based attacks, and the computational overhead is negligible. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our proposed methods, which significantly improve the transferability (i.e., on average, \textbf{5.1\%} for normally trained models and \textbf{11.2\%} for adversarially trained defenses). Our code is available at: \url{https://github.com/qilong-zhang/Staircase-sign-method}.
Abstract:Although great progress has been made on adversarial attacks for deep neural networks (DNNs), their transferability is still unsatisfactory, especially for targeted attacks. There are two problems behind that have been long overlooked: 1) the conventional setting of $T$ iterations with the step size of $\epsilon/T$ to comply with the $\epsilon$-constraint. In this case, most of the pixels are allowed to add very small noise, much less than $\epsilon$; and 2) usually manipulating pixel-wise noise. However, features of a pixel extracted by DNNs are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. To tackle these issues, we propose a patch-wise iterative method (PIM) aimed at crafting adversarial examples with high transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. But targeted attacks aim to push the adversarial examples into the territory of a specific class, and the amplification factor may lead to underfitting. Thus, we introduce the temperature and propose a patch-wise++ iterative method (PIM++) to further improve transferability without significantly sacrificing the performance of the white-box attack. Our method can be generally integrated to any gradient-based attack method. Compared with the current state-of-the-art attack methods, we significantly improve the success rate by 35.9\% for defense models and 32.7\% for normally trained models on average.
Abstract:By adding human-imperceptible noise to clean images, the resultant adversarial examples can fool other unknown models. Features of a pixel extracted by deep neural networks (DNNs) are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. Motivated by this, we propose a patch-wise iterative algorithm -- a black-box attack towards mainstream normally trained and defense models, which differs from the existing attack methods manipulating pixel-wise noise. In this way, without sacrificing the performance of white-box attack, our adversarial examples can have strong transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. Our method can be generally integrated to any gradient-based attack methods. Compared with the current state-of-the-art attacks, we significantly improve the success rate by 9.2\% for defense models and 3.7\% for normally trained models on average. Our code is available at \url{https://github.com/qilong-zhang/Patch-wise-iterative-attack}