Abstract:Causal Disentangled Representation Learning(CDRL) aims to learn and disentangle low dimensional representations and their underlying causal structure from observations. However, existing disentanglement methods rely on a standard mean-field approximation with a diagonal posterior covariance, which decorrelates all latent dimensions. Additionally, these methods often assume isotropic Gaussian priors for exogenous noise, failing to capture the complex, non-Gaussian statistical properties prevalent in real-world causal factors. Therefore, we propose FlexCausal, a novel CDRL framework based on a block-diagonal covariance VAE. FlexCausal utilizes a Factorized Flow-based Prior to realistically model the complex densities of exogenous noise, effectively decoupling the learning of causal mechanisms from distributional statistics. By integrating supervised alignment objectives with counterfactual consistency constraints, our framework ensures a precise structural correspondence between the learned latent subspaces and the ground-truth causal relations. Finally, we introduce a manifold-aware relative intervention strategy to ensure high-fidelity generation. Experimental results on both synthetic and real-world datasets demonstrate that FlexCausal significantly outperforms other methods.
Abstract:The application of video captioning models aims at translating the content of videos by using accurate natural language. Due to the complex nature inbetween object interaction in the video, the comprehensive understanding of spatio-temporal relations of objects remains a challenging task. Existing methods often fail in generating sufficient feature representations of video content. In this paper, we propose a video captioning model based on dual graphs and gated fusion: we adapt two types of graphs to generate feature representations of video content and utilize gated fusion to further understand these different levels of information. Using a dual-graphs model to generate appearance features and motion features respectively can utilize the content correlation in frames to generate various features from multiple perspectives. Among them, dual-graphs reasoning can enhance the content correlation in frame sequences to generate advanced semantic features; The gated fusion, on the other hand, aggregates the information in multiple feature representations for comprehensive video content understanding. The experiments conducted on worldly used datasets MSVD and MSR-VTT demonstrate state-of-the-art performance of our proposed approach.