Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Suhwan Cho

Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Nov 29, 2023

Minhyeok Lee, Dogyoon Lee, Jungho Lee, Suhwan Cho, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee

Figure 1 for Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Figure 2 for Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Figure 3 for Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Figure 4 for Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Abstract:Referring Image Segmentation (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. Various recent RIS models have achieved state-of-the-art performance by generating contextual tokens to model multimodal features from pretrained encoders and effectively fusing them using transformer-based cross-modal attention. While these methods match language features with image features to effectively identify likely target objects, they often struggle to correctly understand contextual information in complex and ambiguous sentences and scenes. To address this issue, we propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE). The proposed model learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level. In other words, this approach involves mutually complementing across the features of images and language, with a focus on enabling the network to understand interconnected deep contextual information between the two modalities. This learning method enhances the robustness of RIS performance in complex sentences and scenes. Our BTMAE achieves state-of-the-art performance on three popular datasets, and we demonstrate the effectiveness of the proposed method through various ablation studies.

Via

Access Paper or Ask Questions

Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Sep 26, 2023

Suhwan Cho, Minhyeok Lee, Jungho Lee, MyeongAh Cho, Sangyoun Lee

Figure 1 for Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Figure 2 for Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Figure 3 for Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Figure 4 for Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Abstract:Unsupervised video object segmentation (VOS) is a task that aims to detect the most salient object in a video without external guidance about the object. To leverage the property that salient objects usually have distinctive movements compared to the background, recent methods collaboratively use motion cues extracted from optical flow maps with appearance cues extracted from RGB images. However, as optical flow maps are usually very relevant to segmentation masks, the network is easy to be learned overly dependent on the motion cues during network training. As a result, such two-stream approaches are vulnerable to confusing motion cues, making their prediction unstable. To relieve this issue, we design a novel motion-as-option network by treating motion cues as optional. During network training, RGB images are randomly provided to the motion encoder instead of optical flow maps, to implicitly reduce motion dependency of the network. As the learned motion encoder can deal with both RGB images and optical flow maps, two different predictions can be generated depending on which source information is used as motion input. In order to fully exploit this property, we also propose an adaptive output selection algorithm to adopt optimal prediction result at test time. Our proposed approach affords state-of-the-art performance on all public benchmark datasets, even maintaining real-time inference speed.

Via

Access Paper or Ask Questions

Adaptive Graph Convolution Module for Salient Object Detection

Mar 17, 2023

Yongwoo Lee, Minhyeok Lee, Suhwan Cho, Sangyoun Lee

Abstract:Salient object detection (SOD) is a task that involves identifying and segmenting the most visually prominent object in an image. Existing solutions can accomplish this use a multi-scale feature fusion mechanism to detect the global context of an image. However, as there is no consideration of the structures in the image nor the relations between distant pixels, conventional methods cannot deal with complex scenes effectively. In this paper, we propose an adaptive graph convolution module (AGCM) to overcome these limitations. Prototype features are initially extracted from the input image using a learnable region generation layer that spatially groups features in the image. The prototype features are then refined by propagating information between them based on a graph architecture, where each feature is regarded as a node. Experimental results show that the proposed AGCM dramatically improves the SOD performance both quantitatively and quantitatively.

* 4 pages, 3 figures

Via

Access Paper or Ask Questions

Guided Slot Attention for Unsupervised Video Object Segmentation

Mar 15, 2023

Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, Sangyoun Lee

Figure 1 for Guided Slot Attention for Unsupervised Video Object Segmentation

Figure 2 for Guided Slot Attention for Unsupervised Video Object Segmentation

Figure 3 for Guided Slot Attention for Unsupervised Video Object Segmentation

Figure 4 for Guided Slot Attention for Unsupervised Video Object Segmentation

Abstract:Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.

Via

Access Paper or Ask Questions

TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Mar 08, 2023

Seunghoon Lee, Suhwan Cho, Dogyoon Lee, Minhyeok Lee, Sangyoun Lee

Figure 1 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Figure 2 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Figure 3 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Figure 4 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Abstract:Unsupervised Video Object Segmentation (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance. In other words, the network detects the accurate region of the target object in a sequence of RGB frames without prior knowledge. In recent works, two approaches for UVOS have been discussed that can be divided into: appearance and appearance-motion based methods. Appearance based methods utilize the correlation information of inter-frames to capture target object that commonly appears in a sequence. However, these methods does not consider the motion of target object due to exploit the correlation information between randomly paired frames. Appearance-motion based methods, on the other hand, fuse the appearance features from RGB frames with the motion features from optical flow. Motion cue provides useful information since salient objects typically show distinctive motion in a sequence. However, these approaches have the limitation that the dependency on optical flow is dominant. In this paper, we propose a novel framework for UVOS that can address aforementioned limitations of two approaches in terms of both time and scale. Temporal Alignment Fusion aligns the saliency information of adjacent frames with the target frame to leverage the information of adjacent frames. Scale Alignment Decoder predicts the target object mask precisely by aggregating differently scaled feature maps via continuous mapping with implicit neural representation. We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method. Furthermore, we outperform the state-of-the-art methods on DAVIS 2016.

Via

Access Paper or Ask Questions

One-Shot Video Inpainting

Feb 28, 2023

Sangjin Lee, Suhwan Cho, Sangyoun Lee

Abstract:Recently, removing objects from videos and filling in the erased regions using deep video inpainting (VI) algorithms has attracted considerable attention. Usually, a video sequence and object segmentation masks for all frames are required as the input for this task. However, in real-world applications, providing segmentation masks for all frames is quite difficult and inefficient. Therefore, we deal with VI in a one-shot manner, which only takes the initial frame's object mask as its input. Although we can achieve that using naive combinations of video object segmentation (VOS) and VI methods, they are sub-optimal and generally cause critical errors. To address that, we propose a unified pipeline for one-shot video inpainting (OSVI). By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task instead of each separate module. Additionally, unlike the two stage methods that use the predicted masks as ground truth cues, our method is more reliable because the predicted masks can be used as the network's internal guidance. On the synthesized datasets for OSVI, our proposed method outperforms all others both quantitatively and qualitatively.

* AAAI2023 submitted

Via

Access Paper or Ask Questions

Two-stream Decoder Feature Normality Estimating Network for Industrial Anomaly Detection

Feb 20, 2023

Chaewon Park, Minhyeok Lee, Suhwan Cho, Donghyeong Kim, Sangyoun Lee

Figure 1 for Two-stream Decoder Feature Normality Estimating Network for Industrial Anomaly Detection

Figure 2 for Two-stream Decoder Feature Normality Estimating Network for Industrial Anomaly Detection

Figure 3 for Two-stream Decoder Feature Normality Estimating Network for Industrial Anomaly Detection

Figure 4 for Two-stream Decoder Feature Normality Estimating Network for Industrial Anomaly Detection

Abstract:Image reconstruction-based anomaly detection has recently been in the spotlight because of the difficulty of constructing anomaly datasets. These approaches work by learning to model normal features without seeing abnormal samples during training and then discriminating anomalies at test time based on the reconstructive errors. However, these models have limitations in reconstructing the abnormal samples due to their indiscriminate conveyance of features. Moreover, these approaches are not explicitly optimized for distinguishable anomalies. To address these problems, we propose a two-stream decoder network (TSDN), designed to learn both normal and abnormal features. Additionally, we propose a feature normality estimator (FNE) to eliminate abnormal features and prevent high-quality reconstruction of abnormal regions. Evaluation on a standard benchmark demonstrated performance better than state-of-the-art models.

* Accepted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

Via

Access Paper or Ask Questions

Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Dec 09, 2022

Jungho Lee, Minhyeok Lee, Suhwan Cho, Sungmin Woo, Sangyoun Lee

Figure 1 for Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Figure 2 for Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Figure 3 for Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Figure 4 for Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Abstract:Skeleton-based action recognition has attracted considerable attention due to its compact skeletal structure of the human body. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human skeleton have been explored, spatio-temporal dependency is rarely considered. In this paper, we propose the Inter-Frame Curve Network (IFC-Net) to effectively leverage the spatio-temporal dependency of the human skeleton. Our proposed network consists of two novel elements: 1) The Inter-Frame Curve (IFC) module; and 2) Dilated Graph Convolution (D-GC). The IFC module increases the spatio-temporal receptive field by identifying meaningful node connections between every adjacent frame and generating spatio-temporal curves based on the identified node connections. The D-GC allows the network to have a large spatial receptive field, which specifically focuses on the spatial domain. The kernels of D-GC are computed from the given adjacency matrices of the graph and reflect large receptive field in a way similar to the dilated CNNs. Our IFC-Net combines these two modules and achieves state-of-the-art performance on three skeleton-based action recognition benchmarks: NTU-RGB+D 60, NTU-RGB+D 120, and Northwestern-UCLA.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning

Dec 09, 2022

Minjung Kim, MyeongAh Cho, Heansung Lee, Suhwan Cho, Sangyoun Lee

Figure 1 for Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning

Figure 2 for Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning

Figure 3 for Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning

Figure 4 for Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning

Abstract:Occluded person re-identification (Re-ID) in images captured by multiple cameras is challenging because the target person is occluded by pedestrians or objects, especially in crowded scenes. In addition to the processes performed during holistic person Re-ID, occluded person Re-ID involves the removal of obstacles and the detection of partially visible body parts. Most existing methods utilize the off-the-shelf pose or parsing networks as pseudo labels, which are prone to error. To address these issues, we propose a novel Occlusion Correction Network (OCNet) that corrects features through relational-weight learning and obtains diverse and representative features without using external networks. In addition, we present a simple concept of a center feature in order to provide an intuitive solution to pedestrian occlusion scenarios. Furthermore, we suggest the idea of Separation Loss (SL) for focusing on different parts between global features and part features. We conduct extensive experiments on five challenging benchmark datasets for occluded and holistic Re-ID tasks to demonstrate that our method achieves superior performance to state-of-the-art methods especially on occluded scene.

* ICASSP 2022

Via

Access Paper or Ask Questions

Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation

Nov 22, 2022

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Sangyoun Lee

Abstract:Unsupervised video object segmentation aims at detecting and segmenting the most salient object in videos. In recent times, two-stream approaches that collaboratively leverage appearance cues and motion cues have attracted extensive attention thanks to their powerful performance. However, there are two limitations faced by those methods: 1) the domain gap between appearance and motion information is not well considered; and 2) long-term temporal coherence within a video sequence is not exploited. To overcome these limitations, we propose a domain alignment module (DAM) and a temporal aggregation module (TAM). DAM resolves the domain gap between two modalities by forcing the values to be in the same range using a cross-correlation mechanism. TAM captures long-term coherence by extracting and leveraging global cues of a video. On public benchmark datasets, our proposed approach demonstrates its effectiveness, outperforming all existing methods by a substantial margin.

Via

Access Paper or Ask Questions