Alert button
Picture for Dogyoon Lee

Dogyoon Lee

Alert button

Guided Slot Attention for Unsupervised Video Object Segmentation

Mar 15, 2023
Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, Sangyoun Lee

Figure 1 for Guided Slot Attention for Unsupervised Video Object Segmentation
Figure 2 for Guided Slot Attention for Unsupervised Video Object Segmentation
Figure 3 for Guided Slot Attention for Unsupervised Video Object Segmentation
Figure 4 for Guided Slot Attention for Unsupervised Video Object Segmentation

Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.

Viaarxiv icon

TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Mar 08, 2023
Seunghoon Lee, Suhwan Cho, Dogyoon Lee, Minhyeok Lee, Sangyoun Lee

Figure 1 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation
Figure 2 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation
Figure 3 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation
Figure 4 for TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Unsupervised Video Object Segmentation (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance. In other words, the network detects the accurate region of the target object in a sequence of RGB frames without prior knowledge. In recent works, two approaches for UVOS have been discussed that can be divided into: appearance and appearance-motion based methods. Appearance based methods utilize the correlation information of inter-frames to capture target object that commonly appears in a sequence. However, these methods does not consider the motion of target object due to exploit the correlation information between randomly paired frames. Appearance-motion based methods, on the other hand, fuse the appearance features from RGB frames with the motion features from optical flow. Motion cue provides useful information since salient objects typically show distinctive motion in a sequence. However, these approaches have the limitation that the dependency on optical flow is dominant. In this paper, we propose a novel framework for UVOS that can address aforementioned limitations of two approaches in terms of both time and scale. Temporal Alignment Fusion aligns the saliency information of adjacent frames with the target frame to leverage the information of adjacent frames. Scale Alignment Decoder predicts the target object mask precisely by aggregating differently scaled feature maps via continuous mapping with implicit neural representation. We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method. Furthermore, we outperform the state-of-the-art methods on DAVIS 2016.

Viaarxiv icon

DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors

Dec 02, 2022
Dogyoon Lee, Minhyeok Lee, Chajin Shin, Sangyoun Lee

Figure 1 for DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors
Figure 2 for DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors
Figure 3 for DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors
Figure 4 for DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors

Neural Radiance Field(NeRF) has exhibited outstanding three-dimensional(3D) reconstruction quality via the novel view synthesis from multi-view images and paired calibrated camera parameters. However, previous NeRF-based systems have been demonstrated under strictly controlled settings, with little attention paid to less ideal scenarios, including with the presence of noise such as exposure, illumination changes, and blur. In particular, though blur frequently occurs in real situations, NeRF that can handle blurred images has received little attention. The few studies that have investigated NeRF for blurred images have not considered geometric and appearance consistency in 3D space, which is one of the most important factors in 3D reconstruction. This leads to inconsistency and the degradation of the perceptual quality of the constructed scene. Hence, this paper proposes a DP-NeRF, a novel clean NeRF framework for blurred images, which is constrained with two physical priors. These priors are derived from the actual blurring process during image acquisition by the camera. DP-NeRF proposes rigid blurring kernel to impose 3D consistency utilizing the physical priors and adaptive weight proposal to refine the color composition error in consideration of the relationship between depth and blur. We present extensive experimental results for synthetic and real scenes with two types of blur: camera motion blur and defocus blur. The results demonstrate that DP-NeRF successfully improves the perceptual quality of the constructed NeRF ensuring 3D geometric and appearance consistency. We further demonstrate the effectiveness of our model with comprehensive ablation analysis.

Viaarxiv icon

Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection

Nov 22, 2022
Minhyeok Lee, Suhwan Cho, Chaewon Park, Dogyoon Lee, Jungho Lee, Sangyoun Lee

Figure 1 for Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection
Figure 2 for Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection
Figure 3 for Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection
Figure 4 for Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection

The camouflaged object detection (COD) task aims to find and segment objects that have a color or texture that is very similar to that of the background. Despite the difficulties of the task, COD is attracting attention in medical, lifesaving, and anti-military fields. To overcome the difficulties of COD, we propose a novel global-local aggregation architecture with a deformable point sampling method. Further, we propose a global-local aggregation transformer that integrates an object's global information, background, and boundary local information, which is important in COD tasks. The proposed transformer obtains global information from feature channels and effectively extracts important local information from the subdivided patch using the deformable point sampling method. Accordingly, the model effectively integrates global and local information for camouflaged objects and also shows that important boundary information in COD can be efficiently utilized. Our method is evaluated on three popular datasets and achieves state-of-the-art performance. We prove the effectiveness of the proposed method through comparative experiments.

Viaarxiv icon

Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition

Aug 23, 2022
Jungho Lee, Minhyeok Lee, Dogyoon Lee, Sangyoon Lee

Figure 1 for Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition
Figure 2 for Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition
Figure 3 for Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition
Figure 4 for Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition

Graph convolutional networks (GCNs) are the most commonly used method for skeleton-based action recognition and have achieved remarkable performance. Generating adjacency matrices with semantically meaningful edges is particularly important for this task, but extracting such edges is challenging problem. To solve this, we propose a hierarchically decomposed graph convolutional network (HD-GCN) architecture with a novel hierarchically decomposed graph (HD-Graph). The proposed HD-GCN effectively decomposes every joint node into several sets to extract major adjacent and distant edges, and uses them to construct an HD-Graph containing those edges in the same semantic spaces of a human skeleton. In addition, we introduce an attention-guided hierarchy aggregation (A-HA) module to highlight the dominant hierarchical edge sets of the HD-Graph. Furthermore, we apply a new two-stream-three-graph ensemble method, which uses only joint and bone stream without any motion stream. The proposed model is evaluated and achieves state-of-the-art performance on three large, popular datasets: NTU-RGB+D 60, NTU-RGB+D 120, and Northwestern-UCLA. Finally, we demonstrate the effectiveness of our model with various comparative experiments.

* 9 pages, 5 figures 
Viaarxiv icon

Expanded Adaptive Scaling Normalization for End to End Image Compression

Aug 05, 2022
Chajin Shin, Hyeongmin Lee, Hanbin Son, Sangjin Lee, Dogyoon Lee, Sangyoun Lee

Recently, learning-based image compression methods that utilize convolutional neural layers have been developed rapidly. Rescaling modules such as batch normalization which are often used in convolutional neural networks do not operate adaptively for the various inputs. Therefore, Generalized Divisible Normalization(GDN) has been widely used in image compression to rescale the input features adaptively across both spatial and channel axes. However, the representation power or degree of freedom of GDN is severely limited. Additionally, GDN cannot consider the spatial correlation of an image. To handle the limitations of GDN, we construct an expanded form of the adaptive scaling module, named Expanded Adaptive Scaling Normalization(EASN). First, we exploit the swish function to increase the representation ability. Then, we increase the receptive field to make the adaptive rescaling module consider the spatial correlation. Furthermore, we introduce an input mapping function to give the module a higher degree of freedom. We demonstrate how our EASN works in an image compression network using the visualization results of the feature map, and we conduct extensive experiments to show that our EASN increases the rate-distortion performance remarkably, and even outperforms the VVC intra at a high bit rate.

* ECCV2022 Accepted 
Viaarxiv icon

CKConv: Learning Feature Voxelization for Point Cloud Analysis

Jul 27, 2021
Sungmin Woo, Dogyoon Lee, Junhyeop Lee, Sangwon Hwang, Woojin Kim, Sangyoun Lee

Figure 1 for CKConv: Learning Feature Voxelization for Point Cloud Analysis
Figure 2 for CKConv: Learning Feature Voxelization for Point Cloud Analysis
Figure 3 for CKConv: Learning Feature Voxelization for Point Cloud Analysis
Figure 4 for CKConv: Learning Feature Voxelization for Point Cloud Analysis

Despite the remarkable success of deep learning, optimal convolution operation on point cloud remains indefinite due to its irregular data structure. In this paper, we present Cubic Kernel Convolution (CKConv) that learns to voxelize the features of local points by exploiting both continuous and discrete convolutions. Our continuous convolution uniquely employs a 3D cubic form of kernel weight representation that splits a feature into voxels in embedding space. By consecutively applying discrete 3D convolutions on the voxelized features in a spatial manner, preceding continuous convolution is forced to learn spatial feature mapping, i.e., feature voxelization. In this way, geometric information can be detailed by encoding with subdivided features, and our 3D convolutions on these fixed structured data do not suffer from discretization artifacts thanks to voxelization in embedding space. Furthermore, we propose a spatial attention module, Local Set Attention (LSA), to provide comprehensive structure awareness within the local point set and hence produce representative features. By learning feature voxelization with LSA, CKConv can extract enriched features for effective point cloud analysis. We show that CKConv has great applicability to point cloud processing tasks including object classification, object part segmentation, and scene semantic segmentation with state-of-the-art results.

Viaarxiv icon

Robust Lane Detection via Expanded Self Attention

Feb 14, 2021
Minhyeok Lee, Junhyeop Lee, Dogyoon Lee, Woojin Kim, Sangwon Hwang, Sangyoun Lee

Figure 1 for Robust Lane Detection via Expanded Self Attention
Figure 2 for Robust Lane Detection via Expanded Self Attention
Figure 3 for Robust Lane Detection via Expanded Self Attention
Figure 4 for Robust Lane Detection via Expanded Self Attention

The image-based lane detection algorithm is one of the key technologies in autonomous vehicles. Modern deep learning methods achieve high performance in lane detection, but it is still difficult to accurately detect lanes in challenging situations such as congested roads and extreme lighting conditions. To be robust on these challenging situations, it is important to extract global contextual information even from limited visual cues. In this paper, we propose a simple but powerful self-attention mechanism optimized for lane detection called the Expanded Self Attention (ESA) module. Inspired by the simple geometric structure of lanes, the proposed method predicts the confidence of a lane along the vertical and horizontal directions in an image. The prediction of the confidence enables estimating occluded locations by extracting global contextual information. ESA module can be easily implemented and applied to any encoder-decoder-based model without increasing the inference time. The performance of our method is evaluated on three popular lane detection benchmarks (TuSimple, CULane and BDD100K). We achieve state-of-the-art performance in CULane and BDD100K and distinct improvement on TuSimple dataset. The experimental results show that our approach is robust to occlusion and extreme lighting conditions.

* 10 pages, 8 figures, 4 tables 
Viaarxiv icon

Regularization Strategy for Point Cloud via Rigidly Mixed Sample

Feb 03, 2021
Dogyoon Lee, Jaeha Lee, Junhyeop Lee, Hyeongmin Lee, Minhyeok Lee, Sungmin Woo, Sangyoun Lee

Figure 1 for Regularization Strategy for Point Cloud via Rigidly Mixed Sample
Figure 2 for Regularization Strategy for Point Cloud via Rigidly Mixed Sample
Figure 3 for Regularization Strategy for Point Cloud via Rigidly Mixed Sample
Figure 4 for Regularization Strategy for Point Cloud via Rigidly Mixed Sample

Data augmentation is an effective regularization strategy to alleviate the overfitting, which is an inherent drawback of the deep neural networks. However, data augmentation is rarely considered for point cloud processing despite many studies proposing various augmentation methods for image data. Actually, regularization is essential for point clouds since lack of generality is more likely to occur in point cloud due to small datasets. This paper proposes a Rigid Subset Mix (RSMix), a novel data augmentation method for point clouds that generates a virtual mixed sample by replacing part of the sample with shape-preserved subsets from another sample. RSMix preserves structural information of the point cloud sample by extracting subsets from each sample without deformation using a neighboring function. The neighboring function was carefully designed considering unique properties of point cloud, unordered structure and non-grid. Experiments verified that RSMix successfully regularized the deep neural networks with remarkable improvement for shape classification. We also analyzed various combinations of data augmentations including RSMix with single and multi-view evaluations, based on abundant ablation studies.

* 10 pages, 5 figures, 7 tables 
Viaarxiv icon

False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier

May 28, 2020
Sungmin Woo, Sangwon Hwang, Woojin Kim, Junhyeop Lee, Dogyoon Lee, Sangyoun Lee

Figure 1 for False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier
Figure 2 for False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier
Figure 3 for False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier
Figure 4 for False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier

Recently, researchers have been leveraging LiDAR point cloud for higher accuracy in 3D vehicle detection. Most state-of-the-art methods are deep learning based, but are easily affected by the number of points generated on the object. This vulnerability leads to numerous false positive boxes at high recall positions, where objects are occasionally predicted with few points. To address the issue, we introduce Penetrated Point Classifier (PPC) based on the underlying property of LiDAR that points cannot be generated behind vehicles. It determines whether a point exists behind the vehicle of the predicted box, and if does, the box is distinguished as false positive. Our straightforward yet unprecedented approach is evaluated on KITTI dataset and achieved performance improvement of PointRCNN, one of the state-of-the-art methods. The experiment results show that precision at the highest recall position is dramatically increased by 15.46 percentage points and 14.63 percentage points on the moderate and hard difficulty of car class, respectively.

* Accepted by ICIP 2020 
Viaarxiv icon