Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robby T. Tan

Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Oct 06, 2022

Yeying Jin, Wending Yan, Wenhan Yang, Robby T. Tan

Figure 1 for Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Figure 2 for Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Figure 3 for Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Figure 4 for Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Abstract:Few existing image defogging or dehazing methods consider dense and non-uniform particle distributions, which usually happen in smoke, dust and fog. Dealing with these dense and/or non-uniform distributions can be intractable, since fog's attenuation and airlight (or veiling effect) significantly weaken the background scene information in the input image. To address this problem, we introduce a structure-representation network with uncertainty feedback learning. Specifically, we extract the feature representations from a pre-trained Vision Transformer (DINO-ViT) module to recover the background information. To guide our network to focus on non-uniform fog areas, and then remove the fog accordingly, we introduce the uncertainty feedback learning, which produces the uncertainty maps, that have higher uncertainty in denser fog regions, and can be regarded as an attention map that represents fog's density and uneven distribution. Based on the uncertainty map, our feedback network refines our defogged output iteratively. Moreover, to handle the intractability of estimating the atmospheric light colors, we exploit the grayscale version of our input image, since it is less affected by varying light colors that are possibly present in the input image. The experimental results demonstrate the effectiveness of our method both quantitatively and qualitatively compared to the state-of-the-art methods in handling dense and non-uniform fog or smoke.

* Accepted to ACCV2022, data in https://github.com/jinyeying/FogRemoval

Via

Access Paper or Ask Questions

Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons

Aug 25, 2022

Yu Cheng, Yihao Ai, Bo Wang, Xinchao Wang, Robby T. Tan

Figure 1 for Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons

Figure 2 for Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons

Figure 3 for Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons

Figure 4 for Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons

Abstract:In multi-person 2D pose estimation, the bottom-up methods simultaneously predict poses for all persons, and unlike the top-down methods, do not rely on human detection. However, the SOTA bottom-up methods' accuracy is still inferior compared to the existing top-down methods. This is due to the predicted human poses being regressed based on the inconsistent human bounding box center and the lack of human-scale normalization, leading to the predicted human poses being inaccurate and small-scale persons being missed. To push the envelope of the bottom-up pose estimation, we firstly propose multi-scale training to enhance the network to handle scale variation with single-scale testing, particularly for small-scale persons. Secondly, we introduce dual anatomical centers (i.e., head and body), where we can predict the human poses more accurately and reliably, especially for small-scale persons. Moreover, existing bottom-up methods use multi-scale testing to boost the accuracy of pose estimation at the price of multiple additional forward passes, which weakens the efficiency of bottom-up methods, the core strength compared to top-down methods. By contrast, our multi-scale training enables the model to predict high-quality poses in a single forward pass (i.e., single-scale testing). Our method achieves 38.4\% improvement on bounding box precision and 39.1\% improvement on bounding box recall over the state of the art (SOTA) on the challenging small-scale persons subset of COCO. For the human pose AP evaluation, we achieve a new SOTA (71.0 AP) on the COCO test-dev set with the single-scale testing. We also achieve the top performance (40.3 AP) on OCHuman dataset in cross-dataset evaluation.

* 29 pages, 12 figures and 6 tables

Via

Access Paper or Ask Questions

Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression

Jul 21, 2022

Yeying Jin, Wenhan Yang, Robby T. Tan

Figure 1 for Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression

Figure 2 for Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression

Figure 3 for Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression

Figure 4 for Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression

Abstract:Night images suffer not only from low light, but also from uneven distributions of light. Most existing night visibility enhancement methods focus mainly on enhancing low-light regions. This inevitably leads to over enhancement and saturation in bright regions, such as those regions affected by light effects (glare, floodlight, etc). To address this problem, we need to suppress the light effects in bright regions while, at the same time, boosting the intensity of dark regions. With this idea in mind, we introduce an unsupervised method that integrates a layer decomposition network and a light-effects suppression network. Given a single night image as input, our decomposition network learns to decompose shading, reflectance and light-effects layers, guided by unsupervised layer-specific prior losses. Our light-effects suppression network further suppresses the light effects and, at the same time, enhances the illumination in dark regions. This light-effects suppression network exploits the estimated light-effects layer as the guidance to focus on the light-effects regions. To recover the background details and reduce hallucination/artefacts, we propose structure and high-frequency consistency losses. Our quantitative and qualitative evaluations on real images show that our method outperforms state-of-the-art methods in suppressing night light effects and boosting the intensity of dark regions.

* published ECCV2022
* Accepted to ECCV2022

Via

Access Paper or Ask Questions

DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network

Jul 21, 2022

Yeying Jin, Aashish Sharma, Robby T. Tan

Figure 1 for DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network

Figure 2 for DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network

Figure 3 for DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network

Figure 4 for DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network

Abstract:Shadow removal from a single image is generally still an open problem. Most existing learning-based methods use supervised learning and require a large number of paired images (shadow and corresponding non-shadow images) for training. A recent unsupervised method, Mask-ShadowGAN, addresses this limitation. However, it requires a binary mask to represent shadow regions, making it inapplicable to soft shadows. To address the problem, in this paper, we propose an unsupervised domain-classifier guided shadow removal network, DC-ShadowNet. Specifically, we propose to integrate a shadow/shadow-free domain classifier into a generator and its discriminator, enabling them to focus on shadow regions. To train our network, we introduce novel losses based on physics-based shadow-free chromaticity, shadow-robust perceptual features, and boundary smoothness. Moreover, we show that our unsupervised network can be used for test-time training that further improves the results. Our experiments show that all these novel components allow our method to handle soft shadows, and also to perform better on hard shadows both quantitatively and qualitatively than the existing state-of-the-art shadow removal methods.

* published in ICCV2021
* Accepted to ICCV2021, https://github.com/jinyeying/DC-ShadowNet-Hard-and-Soft-Shadow-Removal

Via

Access Paper or Ask Questions

Feature-Aligned Video Raindrop Removal with Temporal Constraints

May 29, 2022

Wending Yan, Lu Xu, Wenhan Yang, Robby T. Tan

Figure 1 for Feature-Aligned Video Raindrop Removal with Temporal Constraints

Figure 2 for Feature-Aligned Video Raindrop Removal with Temporal Constraints

Figure 3 for Feature-Aligned Video Raindrop Removal with Temporal Constraints

Figure 4 for Feature-Aligned Video Raindrop Removal with Temporal Constraints

Abstract:Existing adherent raindrop removal methods focus on the detection of the raindrop locations, and then use inpainting techniques or generative networks to recover the background behind raindrops. Yet, as adherent raindrops are diverse in sizes and appearances, the detection is challenging for both single image and video. Moreover, unlike rain streaks, adherent raindrops tend to cover the same area in several frames. Addressing these problems, our method employs a two-stage video-based raindrop removal method. The first stage is the single image module, which generates initial clean results. The second stage is the multiple frame module, which further refines the initial results using temporal constraints, namely, by utilizing multiple input frames in our process and applying temporal consistency between adjacent output frames. Our single image module employs a raindrop removal network to generate initial raindrop removal results, and create a mask representing the differences between the input and initial output. Once the masks and initial results for consecutive frames are obtained, our multiple-frame module aligns the frames in both the image and feature levels and then obtains the clean background. Our method initially employs optical flow to align the frames, and then utilizes deformable convolution layers further to achieve feature-level frame alignment. To remove small raindrops and recover correct backgrounds, a target frame is predicted from adjacent frames. A series of unsupervised losses are proposed so that our second stage, which is the video raindrop removal module, can self-learn from video data without ground truths. Experimental results on real videos demonstrate the state-of-art performance of our method both quantitatively and qualitatively.

Via

Access Paper or Ask Questions

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

May 06, 2022

Yu Cheng, Bo Wang, Robby T. Tan

Figure 1 for Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Figure 2 for Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Figure 3 for Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Figure 4 for Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Abstract:Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.

* Accepted by TPAMI 2022. arXiv admin note: substantial text overlap with arXiv:2104.01797

Via

Access Paper or Ask Questions

Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

May 07, 2021

Lu Liu, Robby T. Tan

Figure 1 for Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Figure 2 for Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Figure 3 for Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Figure 4 for Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Abstract:Human-Object Interaction (HOI) detection aims to detect visual relations between human and objects in images. One significant problem of HOI detection is that non-interactive human-object pair can be easily mis-grouped and misclassified as an action, especially when humans are close and performing similar actions in the scene. To address the mis-grouping problem, we propose a spatial enhancement approach to enforce fine-level spatial constraints in two directions from human body parts to the object center, and from object parts to the human center. At inference, we propose a human-object regrouping approach by considering the object-exclusive property of an action, where the target object should not be shared by more than one human. By suppressing non-interactive pairs, our approach can decrease the false positives. Experiments on V-COCO and HICO-DET datasets demonstrate our approach is more robust compared to the existing methods under the presence of multiple humans and objects in the scene.

Via

Access Paper or Ask Questions

Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Apr 07, 2021

Yu Cheng, Bo Wang, Bo Yang, Robby T. Tan

Figure 1 for Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Figure 2 for Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Figure 3 for Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Figure 4 for Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Abstract:In monocular video 3D multi-person pose estimation, inter-person occlusion and close interactions can cause human detection to be erroneous and human-joints grouping to be unreliable. Existing top-down methods rely on human detection and thus suffer from these problems. Existing bottom-up methods do not use human detection, but they process all persons at once at the same scale, causing them to be sensitive to multiple-persons scale variations. To address these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. Besides the integration of top-down and bottom-up networks, unlike existing pose discriminators that are designed solely for single person, and consequently cannot assess natural inter-person interactions, we propose a two-person pose discriminator that enforces natural two-person interactions. Lastly, we also apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our quantitative and qualitative evaluations show the effectiveness of our method compared to the state-of-the-art baselines.

* Accepted to CVPR 2021. Code is available at: https://github.com/3dpose/3D-Multi-Person-Pose

Via

Access Paper or Ask Questions

Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Dec 22, 2020

Yu Cheng, Bo Wang, Bo Yang, Robby T. Tan

Figure 1 for Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Figure 2 for Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Figure 3 for Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Figure 4 for Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Abstract:Despite the recent progress, 3D multi-person pose estimation from monocular videos is still challenging due to the commonly encountered problem of missing information caused by occlusion, partially out-of-frame target persons, and inaccurate person detection.To tackle this problem, we propose a novel framework integrating graph convolutional networks (GCNs) and temporal convolutional networks (TCNs) to robustly estimate camera-centric multi-person 3D poses that do not require camera parameters. In particular, we introduce a human-joint GCN, which unlike the existing GCN, is based on a directed graph that employs the 2D pose estimator's confidence scores to improve the pose estimation results. We also introduce a human-bone GCN, which models the bone connections and provides more information beyond human joints. The two GCNs work together to estimate the spatial frame-wise 3D poses and can make use of both visible joint and bone information in the target frame to estimate the occluded or missing human-part information. To further refine the 3D pose estimation, we use our temporal convolutional networks (TCNs) to enforce the temporal and human-dynamics constraints. We use a joint-TCN to estimate person-centric 3D poses across frames, and propose a velocity-TCN to estimate the speed of 3D joints to ensure the consistency of the 3D pose estimation in consecutive frames. Finally, to estimate the 3D human poses for multiple persons, we propose a root-TCN that estimates camera-centric 3D poses without requiring camera parameters. Quantitative and qualitative evaluations demonstrate the effectiveness of the proposed method.

* 10 pages, 3 figures, Accepted to AAAI 2021

Via

Access Paper or Ask Questions

Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Oct 16, 2020

Cheng Yu, Bo Wang, Bo Yang, Robby T. Tan

Figure 1 for Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Figure 2 for Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Figure 3 for Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Figure 4 for Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Abstract:Estimating 3D human poses from a monocular video is still a challenging task. Many existing methods' performance drops when the target person is occluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Moreover, we observe that there is a discrepancy between 3D pose prediction and 2D pose estimation due to different pose variations between video and image training datasets. We, therefore propose a confidence-based inference stage optimization to adaptively enforce 3D pose projection to match 2D pose estimation to further improve final pose prediction accuracy. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.

* 14 pages, 13 figures. arXiv admin note: substantial text overlap with arXiv:2004.11822

Via

Access Paper or Ask Questions