Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuchao Dai

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Oct 12, 2023

Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai

Figure 1 for Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Figure 2 for Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Figure 3 for Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Figure 4 for Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Abstract:We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation.

* Accepted by ICCV2023,Project page(https://npucvr.github.io/MMVAE-AVS),Code(https://github.com/OpenNLPLab/MMVAE-AVS)

Via

Access Paper or Ask Questions

Forward Flow for Novel View Synthesis of Dynamic Scenes

Sep 29, 2023

Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang

Figure 1 for Forward Flow for Novel View Synthesis of Dynamic Scenes

Figure 2 for Forward Flow for Novel View Synthesis of Dynamic Scenes

Figure 3 for Forward Flow for Novel View Synthesis of Dynamic Scenes

Figure 4 for Forward Flow for Novel View Synthesis of Dynamic Scenes

Abstract:This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF

* ICCV2023
* Accepted by ICCV2023 as oral. Project page: https://npucvr.github.io/ForwardFlowDNeRF

Via

Access Paper or Ask Questions

RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Sep 26, 2023

Zhexiong Wan, Yuxin Mao, Jing Zhang, Yuchao Dai

Figure 1 for RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Figure 2 for RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Figure 3 for RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Figure 4 for RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Abstract:Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at https://npucvr.github.io/RPEFlow.

* ICCV 2023. Project page: https://npucvr.github.io/RPEFlow Code: https://github.com/danqu130/RPEFlow

Via

Access Paper or Ask Questions

Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Sep 05, 2023

Yufei Wang, Yuxin Mao, Qi Liu, Yuchao Dai

Figure 1 for Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Figure 2 for Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Figure 3 for Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Figure 4 for Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Abstract:RGB-guided depth completion aims at predicting dense depth maps from sparse depth measurements and corresponding RGB images, where how to effectively and efficiently exploit the multi-modal information is a key issue. Guided dynamic filters, which generate spatially-variant depth-wise separable convolutional filters from RGB features to guide depth features, have been proven to be effective in this task. However, the dynamically generated filters require massive model parameters, computational costs and memory footprints when the number of feature channels is large. In this paper, we propose to decompose the guided dynamic filters into a spatially-shared component multiplied by content-adaptive adaptors at each spatial location. Based on the proposed idea, we introduce two decomposition schemes A and B, which decompose the filters by splitting the filter structure and using spatial-wise attention, respectively. The decomposed filters not only maintain the favorable properties of guided dynamic filters as being content-dependent and spatially-variant, but also reduce model parameters and hardware costs, as the learned adaptors are decoupled with the number of feature channels. Extensive experimental results demonstrate that the methods using our schemes outperform state-of-the-art methods on the KITTI dataset, and rank 1st and 2nd on the KITTI benchmark at the time of submission. Meanwhile, they also achieve comparable performance on the NYUv2 dataset. In addition, our proposed methods are general and could be employed as plug-and-play feature fusion blocks in other multi-modal fusion tasks such as RGB-D salient object detection.

Via

Access Paper or Ask Questions

Transferable Attack for Semantic Segmentation

Aug 21, 2023

Mengqi He, Jing Zhang, Zhaoyuan Yang, Mingyi He, Nick Barnes, Yuchao Dai

Figure 1 for Transferable Attack for Semantic Segmentation

Figure 2 for Transferable Attack for Semantic Segmentation

Figure 3 for Transferable Attack for Semantic Segmentation

Figure 4 for Transferable Attack for Semantic Segmentation

Abstract:We analysis performance of semantic segmentation models wrt. adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models. i.e The conventional attack methods, such as PGD and FGSM, do not transfer well to target models, making it necessary to study the transferable attacks, especially transferable attacks for semantic segmentation. We find two main factors to achieve transferable attack. Firstly, the attack should come with effective data augmentation and translation-invariant features to deal with unseen models. Secondly, stabilized optimization strategies are needed to find the optimal attack direction. Based on the above observations, we propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability. The source code and experimental results are publicly available via our project page: https://github.com/anucvers/TASS.

* Source code is available at: https://github.com/anucvers/TASS

Via

Access Paper or Ask Questions

Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

Aug 18, 2023

Haorui Ji, Hui Deng, Yuchao Dai, Hongdong Li

Abstract:Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton, and a frame-by-frame skeleton deformation. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence, and then sum them to obtain the pose of each frame. Subsequently, a loss term based on the diffusion model is used to ensure that the pipeline learns the correct prior motion knowledge. Finally, we have evaluated our proposed method on mainstream datasets and obtained superior results outperforming the state-of-the-art.

Via

Access Paper or Ask Questions

Improving Audio-Visual Segmentation with Bidirectional Generation

Aug 16, 2023

Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong

Abstract:The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

* Dawei Hao and Yuxin Mao contribute equality to this paper. Yiran Zhong is the corresponding author. The code will be released at https://github.com/OpenNLPLab/AVS-bidirectional

Via

Access Paper or Ask Questions

Digging into Depth Priors for Outdoor Neural Radiance Fields

Aug 08, 2023

Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, Liangjun Zhang

Abstract:Neural Radiance Fields (NeRF) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth

* Accepted to ACM MM 2023. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth

Via

Access Paper or Ask Questions

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Jul 31, 2023

Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai

Figure 1 for Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Figure 2 for Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Figure 3 for Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Figure 4 for Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Abstract:We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.

Via

Access Paper or Ask Questions

Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Jul 31, 2023

Zhelun Shen, Xibin Song, Yuchao Dai, Dingfu Zhou, Zhibo Rao, Liangjun Zhang

Figure 1 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Figure 2 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Figure 3 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Figure 4 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Abstract:Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.

* Accepted by TPAMI

Via

Access Paper or Ask Questions