Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robby T. Tan

HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping

Jan 04, 2024

Xin Zhang, Jinheng Xie, Yuan Yuan, Michael Bi Mi, Robby T. Tan

Abstract:Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision. Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing self-supervised transformer features. However, their scopes only build upon patch-level features within an image, neglecting region/image-level and cross-image relationships at a broader scale. Moreover, these methods cannot differentiate various semantics from multiple instances. To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP). Specifically, a novel lightweight head with cross-attention mechanism is designed to adaptively group intra-image patches into semantically coherent regions based on correlation among self-supervised features. Further, to ensure the distinguishability among various regions, we introduce a region-level contrastive clustering loss to pull closer similar regions across images. Also, an image-level contrastive loss is present to push foreground and background representations apart, with which foreground objects and background are accordingly discovered. HEAP facilitates efficient hierarchical image decomposition, which contributes to more accurate object discovery while also enabling differentiation among objects of various classes. Extensive experimental results on semantic segmentation retrieval, unsupervised object discovery, and saliency detection tasks demonstrate that HEAP achieves state-of-the-art performance.

* Accepted by AAAI24

Via

Access Paper or Ask Questions

Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI

Dec 17, 2023

Qinqian Lei, Bo Wang, Robby T. Tan

Abstract:Detecting human-object interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI (Jiang et al. 2022) epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student's learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5-way 5-shot task.

* 9 pages, 4 figures

Via

Access Paper or Ask Questions

ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Sep 21, 2023

Yu Cheng, Bo Wang, Robby T. Tan

Figure 1 for ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Figure 2 for ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Figure 3 for ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Figure 4 for ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Abstract:In 3D human shape and pose estimation from a monocular video, models trained with limited labeled data cannot generalize well to videos with occlusion, which is common in the wild videos. The recent human neural rendering approaches focusing on novel view synthesis initialized by the off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.

* 8 pages, 8 figures

Via

Access Paper or Ask Questions

Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Sep 02, 2023

Chenxin Xu, Robby T. Tan, Yuhong Tan, Siheng Chen, Xinchao Wang, Yanfeng Wang

Figure 1 for Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Figure 2 for Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Figure 3 for Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Figure 4 for Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Abstract:Exploring spatial-temporal dependencies from observed motions is one of the core challenges of human motion prediction. Previous methods mainly focus on dedicated network structures to model the spatial and temporal dependencies. This paper considers a new direction by introducing a model learning framework with auxiliary tasks. In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise and the goal is to recover corrupted coordinates depending on the rest coordinates. To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery via capturing spatial-temporal dependencies. Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning. Extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively. We also demonstrate that our method is more robust under data missing cases and noisy data cases. Code is available at https://github.com/MediaBrain-SJTU/AuxFormer.

* Accpeted to ICCV2023

Via

Access Paper or Ask Questions

Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution

Aug 05, 2023

Yeying Jin, Beibei Lin, Wending Yan, Wei Ye, Yuan Yuan, Robby T. Tan

Abstract:Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Angular Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13$\%$ on GTA5 nighttime haze dataset. Our data and code is available at: \url{https://github.com/jinyeying/nighttime_dehaze}.

* Published in ACMMM2023
* Accepted to ACMMM2023, https://github.com/jinyeying/nighttime_dehaze

Via

Access Paper or Ask Questions

Dynamic Transformers Provide a False Sense of Efficiency

May 20, 2023

Yiming Chen, Simin Chen, Zexin Li, Wei Yang, Cong Liu, Robby T. Tan, Haizhou Li

Figure 1 for Dynamic Transformers Provide a False Sense of Efficiency

Figure 2 for Dynamic Transformers Provide a False Sense of Efficiency

Figure 3 for Dynamic Transformers Provide a False Sense of Efficiency

Figure 4 for Dynamic Transformers Provide a False Sense of Efficiency

Abstract:Despite much success in natural language processing (NLP), pre-trained language models typically lead to a high computational cost during inference. Multi-exit is a mainstream approach to address this issue by making a trade-off between efficiency and accuracy, where the saving of computation comes from an early exit. However, whether such saving from early-exiting is robust remains unknown. Motivated by this, we first show that directly adapting existing adversarial attack approaches targeting model accuracy cannot significantly reduce inference efficiency. To this end, we propose a simple yet effective attacking framework, SAME, a novel slowdown attack framework on multi-exit models, which is specially tailored to reduce the efficiency of the multi-exit models. By leveraging the multi-exit models' design characteristics, we utilize all internal predictions to guide the adversarial sample generation instead of merely considering the final prediction. Experiments on the GLUE benchmark show that SAME can effectively diminish the efficiency gain of various multi-exit models by 80% on average, convincingly validating its effectiveness and generalization ability.

* Accepted by ACL2023

Via

Access Paper or Ask Questions

DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment

May 19, 2023

Heyuan Li, Bo Wang, Yu Cheng, Mohan Kankanhalli, Robby T. Tan

Abstract:Sensitivity to severe occlusion and large view angles limits the usage scenarios of the existing monocular 3D dense face alignment methods. The state-of-the-art 3DMM-based method, directly regresses the model's coefficients, underutilizing the low-level 2D spatial and semantic information, which can actually offer cues for face shape and orientation. In this work, we demonstrate how modeling 3D facial geometry in image and model space jointly can solve the occlusion and view angle problems. Instead of predicting the whole face directly, we regress image space features in the visible facial region by dense prediction first. Subsequently, we predict our model's coefficients based on the regressed feature of the visible regions, leveraging the prior knowledge of whole face geometry from the morphable models to complete the invisible regions. We further propose a fusion network that combines the advantages of both the image and model space predictions to achieve high robustness and accuracy in unconstrained scenarios. Thanks to the proposed fusion module, our method is robust not only to occlusion and large pitch and roll view angles, which is the benefit of our image space approach, but also to noise and large yaw angles, which is the benefit of our model space method. Comprehensive evaluations demonstrate the superior performance of our method compared with the state-of-the-art methods. On the 3D dense face alignment task, we achieve 3.80% NME on the AFLW2000-3D dataset, which outperforms the state-of-the-art method by 5.5%. Code is available at https://github.com/lhyfst/DSFNet.

* Accepted into CVPR'23

Via

Access Paper or Ask Questions

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Mar 29, 2023

Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li

Figure 1 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 2 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 3 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 4 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Abstract:Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global temporal dependency of audio. For evaluation, we propose a new strategy with two different lip-reading experts to measure intelligibility of the generated videos. Rigorous experiments show that our proposal is superior to other State-of-the-art (SOTA) methods, such as Wav2Lip, in reading intelligibility i.e., over 38% Word Error Rate (WER) on LRS2 dataset and 27.8% accuracy on LRW dataset. We also achieve the SOTA performance in lip-speech synchronization and comparable performances in visual quality.

* accepted by CVPR 2023

Via

Access Paper or Ask Questions

EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Mar 27, 2023

Chenxin Xu, Robby T. Tan, Yuhong Tan, Siheng Chen, Yu Guang Wang, Xinchao Wang, Yanfeng Wang

Figure 1 for EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Figure 2 for EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Figure 3 for EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Figure 4 for EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Abstract:Learning to predict agent motions with relationship reasoning is important for many applications. In motion prediction tasks, maintaining motion equivariance under Euclidean geometric transformations and invariance of agent interaction is a critical and fundamental principle. However, such equivariance and invariance properties are overlooked by most existing methods. To fill this gap, we propose EqMotion, an efficient equivariant motion prediction model with invariant interaction reasoning. To achieve motion equivariance, we propose an equivariant geometric feature learning module to learn a Euclidean transformable feature through dedicated designs of equivariant operations. To reason agent's interactions, we propose an invariant interaction reasoning module to achieve a more stable interaction modeling. To further promote more comprehensive motion features, we propose an invariant pattern feature learning module to learn an invariant pattern feature, which cooperates with the equivariant geometric feature to enhance network expressiveness. We conduct experiments for the proposed model on four distinct scenarios: particle dynamics, molecule dynamics, human skeleton motion prediction and pedestrian trajectory prediction. Experimental results show that our method is not only generally applicable, but also achieves state-of-the-art prediction performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is available at https://github.com/MediaBrain-SJTU/EqMotion.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Mar 24, 2023

Mikhail Kennerley, Jian-Gang Wang, Bharadwaj Veeravalli, Robby T. Tan

Figure 1 for 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Figure 2 for 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Figure 3 for 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Figure 4 for 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Abstract:Object detection at night is a challenging problem due to the absence of night image annotations. Despite several domain adaptation methods, achieving high-precision results remains an issue. False-positive error propagation is still observed in methods using the well-established student-teacher framework, particularly for small-scale and low-light objects. This paper proposes a two-phase consistency unsupervised domain adaptation network, 2PCNet, to address these issues. The network employs high-confidence bounding-box predictions from the teacher in the first phase and appends them to the student's region proposals for the teacher to re-evaluate in the second phase, resulting in a combination of high and low confidence pseudo-labels. The night images and pseudo-labels are scaled-down before being used as input to the student, providing stronger small-scale pseudo-labels. To address errors that arise from low-light regions and other night-related attributes in images, we propose a night-specific augmentation pipeline called NightAug. This pipeline involves applying random augmentations, such as glare, blur, and noise, to daytime images. Experiments on publicly available datasets demonstrate that our method achieves superior results to state-of-the-art methods by 20\%, and to supervised models trained directly on the target data.

* Accepted into CVPR'23

Via

Access Paper or Ask Questions