Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenxiong Kang

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Mar 13, 2024

Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang

Figure 1 for SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Figure 2 for SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Figure 3 for SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Figure 4 for SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Abstract:3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utterances. It is mainly due to the interference caused by redundant visual information in cross-modal alignment. To strengthen relation-orientated mapping between different modalities, we propose SeCG, a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. More text-related feature expressions are obtained through the guidance of global semantics and implicit relationships. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods, particularly improving the localization performance for the multi-relation challenges.

Via

Access Paper or Ask Questions

StyleDyRF: Zero-shot 4D Style Transfer for Dynamic Neural Radiance Fields

Mar 13, 2024

Hongbin Xu, Weitao Chen, Feng Xiao, Baigui Sun, Wenxiong Kang

Abstract:4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the novel challenging problem of 4D style transfer for the first time, which further requires the consistency of stylized results on dynamic objects. In this paper, we introduce StyleDyRF, a method that represents the 4D feature space by deforming a canonical feature volume and learns a linear style transformation matrix on the feature volume in a data-driven fashion. To obtain the canonical feature volume, the rays at each time step are deformed with the geometric prior of a pre-trained dynamic NeRF to render the feature map under the supervision of pre-trained visual encoders. With the content and style cues in the canonical feature volume and the style image, we can learn the style transformation matrix from their covariance matrices with lightweight neural networks. The learned style transformation matrix can reflect a direct matching of feature covariance from the content volume to the given style pattern, in analogy with the optimization of the Gram matrix in traditional 2D neural style transfer. The experimental results show that our method not only renders 4D photorealistic style transfer results in a zero-shot manner but also outperforms existing methods in terms of visual quality and consistency.

* In submission. The code and model are released at: https://github.com/ToughStoneX/StyleDyRF

Via

Access Paper or Ask Questions

Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

Dec 18, 2023

Hui Fu, Zeqing Wang, Ke Gong, Keze Wang, Tianshui Chen, Haojie Li, Haifeng Zeng, Wenxiong Kang

Abstract:Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called \textbf{Mimic} to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/

* 7 pages, 6 figures, accepted by AAAI-24

Via

Access Paper or Ask Questions

A Novel Transfer Learning Method Utilizing Acoustic and Vibration Signals for Rotating Machinery Fault Diagnosis

Oct 20, 2023

Zhongliang Chen, Zhuofei Huang, Wenxiong Kang

Abstract:Fault diagnosis of rotating machinery plays a important role for the safety and stability of modern industrial systems. However, there is a distribution discrepancy between training data and data of real-world operation scenarios, which causing the decrease of performance of existing systems. This paper proposed a transfer learning based method utilizing acoustic and vibration signal to address this distribution discrepancy. We designed the acoustic and vibration feature fusion MAVgram to offer richer and more reliable information of faults, coordinating with a DNN-based classifier to obtain more effective diagnosis representation. The backbone was pre-trained and then fine-tuned to obtained excellent performance of the target task. Experimental results demonstrate the effectiveness of the proposed method, and achieved improved performance compared to STgram-MFN.

Via

Access Paper or Ask Questions

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

May 17, 2023

Weitao Chen, Hongbin Xu, Zhipeng Zhou, Yang Liu, Baigui Sun, Wenxiong Kang, Xuansong Xie

Figure 1 for CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

Figure 2 for CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

Figure 3 for CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

Figure 4 for CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

Abstract:The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

* Accepted by IJCAI-23

Via

Access Paper or Ask Questions

Semi-supervised Deep Multi-view Stereo

Jul 24, 2022

Hongbin Xu, Zhipeng Zhou, Weitao Cheng, Baigui Sun, Hao Li, Wenxiong Kang

Figure 1 for Semi-supervised Deep Multi-view Stereo

Figure 2 for Semi-supervised Deep Multi-view Stereo

Figure 3 for Semi-supervised Deep Multi-view Stereo

Figure 4 for Semi-supervised Deep Multi-view Stereo

Abstract:Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) of supervised and unsupervised settings. To combine their respective merits in accuracy and completeness, meantime reducing the demand for expensive labeled data, this paper explores a novel semi-supervised setting of learning-based MVS problem that only a tiny part of the MVS data is attached with dense depth ground truth. However, due to huge variation of scenarios and flexible setting in views, semi-supervised MVS problem (Semi-MVS) may break the basic assumption in classic semi-supervised learning, that unlabeled data and labeled data share the same label space and data distribution. To handle these issues, we propose a novel semi-supervised MVS framework, namely SE-MVS. For the simple case that the basic assumption works in MVS data, consistency regularization encourages the model predictions to be consistent between original sample and randomly augmented sample via constraints on KL divergence. For further troublesome case that the basic assumption is conflicted in MVS data, we propose a novel style consistency loss to alleviate the negative effect caused by the distribution gap. The visual style of unlabeled sample is transferred to labeled sample to shrink the gap, and the model prediction of generated sample is further supervised with the label in original labeled sample. The experimental results on DTU, BlendedMVS, GTA-SFM, and Tanks\&Temples datasets show the superior performance of the proposed method. With the same settings in backbone network, our proposed SE-MVS outperforms its fully-supervised and unsupervised baselines.

* Draft version. Still in submission

Via

Access Paper or Ask Questions

From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition

Dec 15, 2021

Wenxiong Kang, Yuting Lu, Dejian Li, Wei Jia

Figure 1 for From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition

Figure 2 for From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition

Figure 3 for From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition

Figure 4 for From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition

Abstract:Most finger vein feature extraction algorithms achieve satisfactory performance due to their texture representation abilities, despite simultaneously ignoring the intensity distribution that is formed by the finger tissue, and in some cases, processing it as background noise. In this paper, we exploit this kind of noise as a novel soft biometric trait for achieving better finger vein recognition performance. First, a detailed analysis of the finger vein imaging principle and the characteristics of the image are presented to show that the intensity distribution that is formed by the finger tissue in the background can be extracted as a soft biometric trait for recognition. Then, two finger vein background layer extraction algorithms and three soft biometric trait extraction algorithms are proposed for intensity distribution feature extraction. Finally, a hybrid matching strategy is proposed to solve the issue of dimension difference between the primary and soft biometric traits on the score level. A series of rigorous contrast experiments on three open-access databases demonstrates that our proposed method is feasible and effective for finger vein recognition.

* IEEE transactions on information forensics and security 14.4 (2018): 858-869
* 11 pages

Via

Access Paper or Ask Questions

Unconstrained Face Sketch Synthesis via Perception-Adaptive Network and A New Benchmark

Dec 02, 2021

Lin Nie, Lingbo Liu, Zhengtao Wu, Wenxiong Kang

Figure 1 for Unconstrained Face Sketch Synthesis via Perception-Adaptive Network and A New Benchmark

Figure 2 for Unconstrained Face Sketch Synthesis via Perception-Adaptive Network and A New Benchmark

Figure 3 for Unconstrained Face Sketch Synthesis via Perception-Adaptive Network and A New Benchmark

Figure 4 for Unconstrained Face Sketch Synthesis via Perception-Adaptive Network and A New Benchmark

Abstract:Face sketch generation has attracted much attention in the field of visual computing. However, existing methods either are limited to constrained conditions or heavily rely on various preprocessing steps to deal with in-the-wild cases. In this paper, we argue that accurately perceiving facial region and facial components is crucial for unconstrained sketch synthesis. To this end, we propose a novel Perception-Adaptive Network (PANet), which can generate high-quality face sketches under unconstrained conditions in an end-to-end scheme. Specifically, our PANet is composed of i) a Fully Convolutional Encoder for hierarchical feature extraction, ii) a Face-Adaptive Perceiving Decoder for extracting potential facial region and handling face variations, and iii) a Component-Adaptive Perceiving Module for facial component aware feature representation learning. To facilitate further researches of unconstrained face sketch synthesis, we introduce a new benchmark termed WildSketch, which contains 800 pairs of face photo-sketch with large variations in pose, expression, ethnic origin, background, and illumination. Extensive experiments demonstrate that the proposed method is capable of achieving state-of-the-art performance under both constrained and unconstrained conditions. Our source codes and the WildSketch benchmark are resealed on the project page http://lingboliu.com/unconstrained_face_sketch.html.

* We proposed the first medium-scale benchmark for unconstrained face sketch synthesis

Via

Access Paper or Ask Questions

Digging into Uncertainty in Self-supervised Multi-view Stereo

Sep 08, 2021

Hongbin Xu, Zhipeng Zhou, Yali Wang, Wenxiong Kang, Baigui Sun, Hao Li, Yu Qiao

Figure 1 for Digging into Uncertainty in Self-supervised Multi-view Stereo

Figure 2 for Digging into Uncertainty in Self-supervised Multi-view Stereo

Figure 3 for Digging into Uncertainty in Self-supervised Multi-view Stereo

Figure 4 for Digging into Uncertainty in Self-supervised Multi-view Stereo

Abstract:Self-supervised Multi-view stereo (MVS) with a pretext task of image reconstruction has achieved significant progress recently. However, previous methods are built upon intuitions, lacking comprehensive explanations about the effectiveness of the pretext task in self-supervised MVS. To this end, we propose to estimate epistemic uncertainty in self-supervised MVS, accounting for what the model ignores. Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background. To address these issues, we propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning. To alleviate ambiguous supervision in foreground, we involve extra correspondence prior with a flow-depth consistency loss. The dense 2D correspondence of optical flows is used to regularize the 3D stereo correspondence in MVS. To handle the invalid supervision in background, we use Monte-Carlo Dropout to acquire the uncertainty map and further filter the unreliable supervision signals on invalid regions. Extensive experiments on DTU and Tank&Temples benchmark show that our U-MVS framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.

* This paper is accepted by ICCV-21 as a poster presentation

Via

Access Paper or Ask Questions

Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation

Apr 12, 2021

Hongbin Xu, Zhipeng Zhou, Yu Qiao, Wenxiong Kang, Qiuxia Wu

Abstract:Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method.

* This paper is accepted by AAAI-21 with a Distinguished Paper Award

Via

Access Paper or Ask Questions