Alert button
Picture for Shaojie Wang

Shaojie Wang

Alert button

A bioinspired three-stage model for camouflaged object detection

May 22, 2023
Tianyou Chen, Jin Xiao, Xiaoguang Hu, Guofeng Zhang, Shaojie Wang

Figure 1 for A bioinspired three-stage model for camouflaged object detection
Figure 2 for A bioinspired three-stage model for camouflaged object detection
Figure 3 for A bioinspired three-stage model for camouflaged object detection
Figure 4 for A bioinspired three-stage model for camouflaged object detection

Camouflaged objects are typically assimilated into their backgrounds and exhibit fuzzy boundaries. The complex environmental conditions and the high intrinsic similarity between camouflaged targets and their surroundings pose significant challenges in accurately locating and segmenting these objects in their entirety. While existing methods have demonstrated remarkable performance in various real-world scenarios, they still face limitations when confronted with difficult cases, such as small targets, thin structures, and indistinct boundaries. Drawing inspiration from human visual perception when observing images containing camouflaged objects, we propose a three-stage model that enables coarse-to-fine segmentation in a single iteration. Specifically, our model employs three decoders to sequentially process subsampled features, cropped features, and high-resolution original features. This proposed approach not only reduces computational overhead but also mitigates interference caused by background noise. Furthermore, considering the significance of multi-scale information, we have designed a multi-scale feature enhancement module that enlarges the receptive field while preserving detailed structural cues. Additionally, a boundary enhancement module has been developed to enhance performance by leveraging boundary information. Subsequently, a mask-guided fusion module is proposed to generate fine-grained results by integrating coarse prediction maps with high-resolution feature maps. Our network surpasses state-of-the-art CNN-based counterparts without unnecessary complexities. Upon acceptance of the paper, the source code will be made publicly available at https://github.com/clelouch/BTSNet.

Viaarxiv icon

PROVES: Establishing Image Provenance using Semantic Signatures

Oct 21, 2021
Mingyang Xie, Manav Kulshrestha, Shaojie Wang, Jinghan Yang, Ayan Chakrabarti, Ning Zhang, Yevgeniy Vorobeychik

Figure 1 for PROVES: Establishing Image Provenance using Semantic Signatures
Figure 2 for PROVES: Establishing Image Provenance using Semantic Signatures
Figure 3 for PROVES: Establishing Image Provenance using Semantic Signatures
Figure 4 for PROVES: Establishing Image Provenance using Semantic Signatures

Modern AI tools, such as generative adversarial networks, have transformed our ability to create and modify visual data with photorealistic results. However, one of the deleterious side-effects of these advances is the emergence of nefarious uses in manipulating information in visual data, such as through the use of deep fakes. We propose a novel architecture for preserving the provenance of semantic information in images to make them less susceptible to deep fake attacks. Our architecture includes semantic signing and verification steps. We apply this architecture to verifying two types of semantic information: individual identities (faces) and whether the photo was taken indoors or outdoors. Verification accounts for a collection of common image transformation, such as translation, scaling, cropping, and small rotations, and rejects adversarial transformations, such as adversarially perturbed or, in the case of face verification, swapped faces. Experiments demonstrate that in the case of provenance of faces in an image, our approach is robust to black-box adversarial transformations (which are rejected) as well as benign transformations (which are accepted), with few false negatives and false positives. Background verification, on the other hand, is susceptible to black-box adversarial examples, but becomes significantly more robust after adversarial training.

Viaarxiv icon

Towards Robust Sensor Fusion in Visual Perception

Jun 23, 2020
Shaojie Wang, Tong Wu, Yevgeniy Vorobeychik

Figure 1 for Towards Robust Sensor Fusion in Visual Perception
Figure 2 for Towards Robust Sensor Fusion in Visual Perception
Figure 3 for Towards Robust Sensor Fusion in Visual Perception
Figure 4 for Towards Robust Sensor Fusion in Visual Perception

We study the problem of robust sensor fusion in visual perception, especially under the autonomous driving settings. We evaluate the robustness of RGB camera and LiDAR sensor fusion for binary classification and object detection. In this work, we are interested in the behavior of different fusion methods under adversarial attacks on different sensors. We first train both classification and detection models with early fusion and late fusion, then apply different combinations of adversarial attacks on both sensor inputs for evaluation. We also study the effectiveness of adversarial attacks with varying budgets. Experiment results show that while sensor fusion models are generally vulnerable to adversarial attacks, late fusion method is more robust than early fusion. The results also provide insights on further obtaining robust sensor fusion models.

Viaarxiv icon

Weakly Supervised Object Localization with Inter-Intra Regulated CAMs

Nov 19, 2019
Guofeng Cui, Ziyi Kou, Shaojie Wang, Wentian Zhao, Chenliang Xu

Figure 1 for Weakly Supervised Object Localization with Inter-Intra Regulated CAMs
Figure 2 for Weakly Supervised Object Localization with Inter-Intra Regulated CAMs
Figure 3 for Weakly Supervised Object Localization with Inter-Intra Regulated CAMs
Figure 4 for Weakly Supervised Object Localization with Inter-Intra Regulated CAMs

Weakly supervised object localization (WSOL) aims to locate objects in images by learning only from image-level labels. Current methods are trying to obtain localization results relying on Class Activation Maps (CAMs). Usually, they propose additional CAMs or feature maps generated from internal layers of deep networks to encourage different CAMs to be either \textbf{adversarial} or \textbf{cooperated} with each other. In this work, instead of following one of the two main approaches before, we analyze their internal relationship and propose a novel intra-sample strategy which regulates two CAMs of the same sample, generated from different classifiers, to dynamically adapt each of their pixels involved in adversarial or cooperative process based on their own values. We mathematically demonstrate that our approach is a more general version of the current state-of-the-art method with less hyper-parameters. Besides, we further develop an inter-sample criterion module for our WSOL task, which is originally proposed in co-segmentation problems, to refine generated CAMs of each sample. The module considers a subgroup of samples under the same category and regulates their object regions. With experiment on two widely-used datasets, we show that our proposed method significantly outperforms existing state-of-the-art, setting a new record for weakly-supervised object localization.

* Cui and Kou are the co-first author of this paper 
Viaarxiv icon

Weakly Supervised Localization Using Background Images

Sep 11, 2019
Ziyi Kou, Wentian Zhao, Guofeng Cui, Shaojie Wang

Figure 1 for Weakly Supervised Localization Using Background Images
Figure 2 for Weakly Supervised Localization Using Background Images
Figure 3 for Weakly Supervised Localization Using Background Images
Figure 4 for Weakly Supervised Localization Using Background Images

Weakly Supervised Object Localization (WSOL) methodsusually rely on fully convolutional networks in order to ob-tain class activation maps(CAMs) of targeted labels. How-ever, these networks always highlight the most discriminativeparts to perform the task, the located areas are much smallerthan entire targeted objects. In this work, we propose a novelend-to-end model to enlarge CAMs generated from classifi-cation models, which can localize targeted objects more pre-cisely. In detail, we add an additional module in traditionalclassification networks to extract foreground object propos-als from images without classifying them into specific cate-gories. Then we set these normalized regions as unrestrictedpixel-level mask supervision for the following classificationtask. We collect a set of images defined as Background ImageSet from the Internet. The number of them is much smallerthan the targeted dataset but surprisingly well supports themethod to extract foreground regions from different pictures.The region extracted is independent from classification task,where the extracted region in each image covers almost en-tire object rather than just a significant part. Therefore, theseregions can serve as masks to supervise the response mapgenerated from classification models to become larger andmore precise. The method achieves state-of-the-art results onCUB-200-2011 in terms of Top-1 and Top-5 localization er-ror while has a competitive result on ILSVRC2016 comparedwith other approaches.

* Course project of CSC577, University of Rochester 
Viaarxiv icon

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Dec 06, 2018
Shaojie Wang, Wentian Zhao, Ziyi Kou, Chenliang Xu

Figure 1 for How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos
Figure 2 for How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos
Figure 3 for How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos
Figure 4 for How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedure constraining the understanding task. In this paper, we study reasoning on instructional videos via question-answering (QA). Surprisingly, it has not been an emphasis in the video community despite its rich applications. We thereby introduce YouQuek, an annotated QA dataset for instructional videos based on the recent YouCook2. The questions in YouQuek are not limited to cues on one frame but related to logical reasoning in the temporal dimension. Observing the lack of effective representations for modeling long videos, we propose a set of carefully designed models including a novel Recurrent Graph Convolutional Network (RGCN) that captures both temporal order and relation information. Furthermore, we study multiple modalities including description and transcripts for the purpose of boosting video understanding. Extensive experiments on YouQuek suggest that RGCN performs the best in terms of QA accuracy and a better performance is gained by introducing human annotated description.

Viaarxiv icon

GAN-EM: GAN based EM learning framework

Dec 02, 2018
Wentian Zhao, Shaojie Wang, Zhihuai Xie, Jing Shi, Chenliang Xu

Figure 1 for GAN-EM: GAN based EM learning framework
Figure 2 for GAN-EM: GAN based EM learning framework
Figure 3 for GAN-EM: GAN based EM learning framework
Figure 4 for GAN-EM: GAN based EM learning framework

Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly non-Gaussian so that GMM cannot be applied to perform clustering task on pixel space. To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity. We call this model GAN-EM, which is a framework for image clustering, semi-supervised classification and dimensionality reduction. In M-step, we design a novel loss function for discriminator of GAN to perform maximum likelihood estimation (MLE) on data with soft class label assignments. Specifically, a conditional generator captures data distribution for $K$ classes, and a discriminator tells whether a sample is real or fake for each class. Since our model is unsupervised, the class label of real data is regarded as latent variable, which is estimated by an additional network (E-net) in E-step. The proposed GAN-EM achieves state-of-the-art clustering and semi-supervised classification results on MNIST, SVHN and CelebA, as well as comparable quality of generated images to other recently developed generative models.

Viaarxiv icon