Alert button
Picture for Junwen Chen

Junwen Chen

Alert button

ATM: Action Temporality Modeling for Video Question Answering

Sep 05, 2023
Junwen Chen, Jie Zhu, Yu Kong

Figure 1 for ATM: Action Temporality Modeling for Video Question Answering
Figure 2 for ATM: Action Temporality Modeling for Video Question Answering
Figure 3 for ATM: Action Temporality Modeling for Video Question Answering
Figure 4 for ATM: Action Temporality Modeling for Video Question Answering

Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.

Viaarxiv icon

Defending Adversarial Patches via Joint Region Localizing and Inpainting

Jul 26, 2023
Junwen Chen, Xingxing Wei

Figure 1 for Defending Adversarial Patches via Joint Region Localizing and Inpainting
Figure 2 for Defending Adversarial Patches via Joint Region Localizing and Inpainting
Figure 3 for Defending Adversarial Patches via Joint Region Localizing and Inpainting
Figure 4 for Defending Adversarial Patches via Joint Region Localizing and Inpainting

Deep neural networks are successfully used in various applications, but show their vulnerability to adversarial examples. With the development of adversarial patches, the feasibility of attacks in physical scenes increases, and the defenses against patch attacks are urgently needed. However, defending such adversarial patch attacks is still an unsolved problem. In this paper, we analyse the properties of adversarial patches, and find that: on the one hand, adversarial patches will lead to the appearance or contextual inconsistency in the target objects; on the other hand, the patch region will show abnormal changes on the high-level feature maps of the objects extracted by a backbone network. Considering the above two points, we propose a novel defense method based on a ``localizing and inpainting" mechanism to pre-process the input examples. Specifically, we design an unified framework, where the ``localizing" sub-network utilizes a two-branch structure to represent the above two aspects to accurately detect the adversarial patch region in the image. For the ``inpainting" sub-network, it utilizes the surrounding contextual cues to recover the original content covered by the adversarial patch. The quality of inpainted images is also evaluated by measuring the appearance consistency and the effects of adversarial attacks. These two sub-networks are then jointly trained via an iterative optimization manner. In this way, the ``localizing" and ``inpainting" modules can interact closely with each other, and thus learn a better solution. A series of experiments versus traffic sign classification and detection tasks are conducted to defend against various adversarial patch attacks.

Viaarxiv icon

Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Jul 05, 2023
Junwen Chen, Yingcheng Wang, Keiji Yanai

Figure 1 for Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising
Figure 2 for Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising
Figure 3 for Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising
Figure 4 for Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection and share the prediction burden, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a verb decoder. Moreover, we propose a novel Specific Target Guided (STG) DeNoising strategy, which leverages learnable object and verb label embeddings to guide the training and accelerates the training convergence. In addition, for the inference part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Without additional features or prior language knowledge, our method (SOV-STG) achieves higher accuracy than the state-of-the-art method in one-third of training epochs. The code is available at \url{https://github.com/cjw2021/SOV-STG}.

Viaarxiv icon

GateHUB: Gated History Unit with Background Suppression for Online Action Detection

Jun 09, 2022
Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, Mei Chen

Figure 1 for GateHUB: Gated History Unit with Background Suppression for Online Action Detection
Figure 2 for GateHUB: Gated History Unit with Background Suppression for Online Action Detection

Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We present GateHUB, Gated History Unit with Background Suppression, that comprises a novel position-guided gated cross-attention mechanism to enhance or suppress parts of the history as per how informative they are for current frame prediction. GateHUB further proposes Future-augmented History (FaH) to make history features more informative by using subsequently observed frames when available. In a single unified framework, GateHUB integrates the transformer's ability of long-range temporal modeling and the recurrent model's capacity to selectively encode relevant information. GateHUB also introduces a background suppression objective to further mitigate false positive background frames that closely resemble the action frames. Extensive validation on three benchmark datasets, THUMOS, TVSeries, and HDD, demonstrates that GateHUB significantly outperforms all existing methods and is also more efficient than the existing best work. Furthermore, a flow-free version of GateHUB is able to achieve higher or close accuracy at 2.8x higher frame rate compared to all existing methods that require both RGB and optical flow information for prediction.

* CVPR 2022 
Viaarxiv icon

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

May 10, 2022
Jing Yang, Junwen Chen, Keiji Yanai

Figure 1 for Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training
Figure 2 for Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training
Figure 3 for Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training
Figure 4 for Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME~(Adversarial Cross-Modal Embedding) and H-T~(Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer~(ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embeddings.

* 13 pages, 8 figures 
Viaarxiv icon

QAHOI: Query-Based Anchors for Human-Object Interaction Detection

Dec 16, 2021
Junwen Chen, Keiji Yanai

Figure 1 for QAHOI: Query-Based Anchors for Human-Object Interaction Detection
Figure 2 for QAHOI: Query-Based Anchors for Human-Object Interaction Detection
Figure 3 for QAHOI: Query-Based Anchors for Human-Object Interaction Detection
Figure 4 for QAHOI: Query-Based Anchors for Human-Object Interaction Detection

Human-object interaction (HOI) detection as a downstream of object detection tasks requires localizing pairs of humans and objects and extracting the semantic relationships between humans and objects from an image. Recently, one-stage approaches have become a new trend for this task due to their high efficiency. However, these approaches focus on detecting possible interaction points or filtering human-object pairs, ignoring the variability in the location and size of different objects at spatial scales. To address this problem, we propose a transformer-based method, QAHOI (Query-Based Anchors for Human-Object Interaction detection), which leverages a multi-scale architecture to extract features from different spatial scales and uses query-based anchors to predict all the elements of an HOI instance. We further investigate that a powerful backbone significantly increases accuracy for QAHOI, and QAHOI with a transformer-based backbone outperforms recent state-of-the-art methods by large margins on the HICO-DET benchmark. The source code is available at $\href{https://github.com/cjw2021/QAHOI}{\text{this https URL}}$.

Viaarxiv icon

Group Activity Prediction with Sequential Relational Anticipation Model

Aug 06, 2020
Junwen Chen, Wentao Bao, Yu Kong

Figure 1 for Group Activity Prediction with Sequential Relational Anticipation Model
Figure 2 for Group Activity Prediction with Sequential Relational Anticipation Model
Figure 3 for Group Activity Prediction with Sequential Relational Anticipation Model
Figure 4 for Group Activity Prediction with Sequential Relational Anticipation Model

In this paper, we propose a novel approach to predict group activities given the beginning frames with incomplete activity executions. Existing action prediction approaches learn to enhance the representation power of the partial observation. However, for group activity prediction, the relation evolution of people's activity and their positions over time is an important cue for predicting group activity. To this end, we propose a sequential relational anticipation model (SRAM) that summarizes the relational dynamics in the partial observation and progressively anticipates the group representations with rich discriminative information. Our model explicitly anticipates both activity features and positions by two graph auto-encoders, aiming to learn a discriminative group representation for group activity prediction. Experimental results on two popularly used datasets demonstrate that our approach significantly outperforms the state-of-the-art activity prediction methods.

* This paper is accepted to ECCV2020 
Viaarxiv icon

ContourRend: A Segmentation Method for Improving Contours by Rendering

Jul 15, 2020
Junwen Chen, Yi Lu, Yaran Chen, Dongbin Zhao, Zhonghua Pang

Figure 1 for ContourRend: A Segmentation Method for Improving Contours by Rendering
Figure 2 for ContourRend: A Segmentation Method for Improving Contours by Rendering
Figure 3 for ContourRend: A Segmentation Method for Improving Contours by Rendering
Figure 4 for ContourRend: A Segmentation Method for Improving Contours by Rendering

A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which adopts a contour renderer to refine segmentation contours. And we implement our method on a segmentation model based on graph convolutional network (GCN). For the single object segmentation task on cityscapes dataset, the GCN-based segmentation con-tour is used to generate a contour of a single object, then our contour renderer focuses on the pixels around the contour and predicts the category at high resolution. By rendering the contour result, our method reaches 72.41% mean intersection over union (IoU) and surpasses baseline Polygon-GCN by 1.22%.

Viaarxiv icon

Adversarial Multi-Binary Neural Network for Multi-class Classification

Mar 25, 2020
Haiyang Xu, Junwen Chen, Kun Han, Xiangang Li

Figure 1 for Adversarial Multi-Binary Neural Network for Multi-class Classification
Figure 2 for Adversarial Multi-Binary Neural Network for Multi-class Classification
Figure 3 for Adversarial Multi-Binary Neural Network for Multi-class Classification

Multi-class text classification is one of the key problems in machine learning and natural language processing. Emerging neural networks deal with the problem using a multi-output softmax layer and achieve substantial progress, but they do not explicitly learn the correlation among classes. In this paper, we use a multi-task framework to address multi-class classification, where a multi-class classifier and multiple binary classifiers are trained together. Moreover, we employ adversarial training to distinguish the class-specific features and the class-agnostic features. The model benefits from better feature representation. We conduct experiments on two large-scale multi-class text classification tasks and demonstrate that the proposed architecture outperforms baseline approaches.

Viaarxiv icon

Learning Syntactic and Dynamic Selective Encoding for Document Summarization

Mar 25, 2020
Haiyang Xu, Yahao He, Kun Han, Junwen Chen, Xiangang Li

Figure 1 for Learning Syntactic and Dynamic Selective Encoding for Document Summarization
Figure 2 for Learning Syntactic and Dynamic Selective Encoding for Document Summarization
Figure 3 for Learning Syntactic and Dynamic Selective Encoding for Document Summarization
Figure 4 for Learning Syntactic and Dynamic Selective Encoding for Document Summarization

Text summarization aims to generate a headline or a short summary consisting of the major information of the source text. Recent studies employ the sequence-to-sequence framework to encode the input with a neural network and generate abstractive summary. However, most studies feed the encoder with the semantic word embedding but ignore the syntactic information of the text. Further, although previous studies proposed the selective gate to control the information flow from the encoder to the decoder, it is static during the decoding and cannot differentiate the information based on the decoder states. In this paper, we propose a novel neural architecture for document summarization. Our approach has the following contributions: first, we incorporate syntactic information such as constituency parsing trees into the encoding sequence to learn both the semantic and syntactic information from the document, resulting in more accurate summary; second, we propose a dynamic gate network to select the salient information based on the context of the decoder state, which is essential to document summarization. The proposed model has been evaluated on CNN/Daily Mail summarization datasets and the experimental results show that the proposed approach outperforms baseline approaches.

* IJCNN 2019 
Viaarxiv icon