Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Xiao

Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Jun 16, 2024

Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guoping Qiu

Figure 1 for Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Figure 2 for Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Figure 3 for Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Figure 4 for Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Abstract:As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous deep-learning-based methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.

* 13 pages, 12 figures, journal

Via

Access Paper or Ask Questions

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Jun 16, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

Figure 1 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 2 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 3 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 4 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous frames. For Frame as Prompt, we inject the conditional frames by concatenating them with noisy frames (frames to be generated) along the temporal axis. Consequently, we present Video Diffusion GPT (ViD-GPT). Based on the two key designs, in each autoregression step, it is able to acquire long-term context from prompting frames concatenated by all previously generated frames. Additionally, we bring the kv-cache mechanism to VDMs, which eliminates the redundant computation from overlapped frames, significantly boosting the inference speed. Extensive experiments demonstrate that our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation. Code will be available at https://github.com/Dawn-LX/Causal-VideoGen.

* Code will be available at https://github.com/Dawn-LX/Causal-VideoGen

Via

Access Paper or Ask Questions

$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

May 27, 2024

Weiquan Wang, Jun Xiao, Chunping Wang, Wei Liu, Zhao Wang, Long Chen

$Figure 1 for $\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation$

$Figure 2 for $\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation$

$Figure 3 for $\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation$

$Figure 4 for $\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation$

Abstract:Continuous diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the Discrete Diffusion Pose ($\text{Di}^2\text{Pose}$), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, $\text{Di}^2\text{Pose}$ employs a two-stage process: it first converts 3D poses into a discrete representation through a \emph{pose quantization step}, which is subsequently modeled in latent space through a \emph{discrete diffusion process}. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model's capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.

Via

Access Paper or Ask Questions

PS-CAD: Local Geometry Guidance via Prompting and Selection for CAD Reconstruction

May 24, 2024

Bingchen Yang, Haiyong Jiang, Hao Pan, Peter Wonka, Jun Xiao, Guosheng Lin

Abstract:Reverse engineering CAD models from raw geometry is a classic but challenging research problem. In particular, reconstructing the CAD modeling sequence from point clouds provides great interpretability and convenience for editing. To improve upon this problem, we introduce geometric guidance into the reconstruction network. Our proposed model, PS-CAD, reconstructs the CAD modeling sequence one step at a time. At each step, we provide two forms of geometric guidance. First, we provide the geometry of surfaces where the current reconstruction differs from the complete model as a point cloud. This helps the framework to focus on regions that still need work. Second, we use geometric analysis to extract a set of planar prompts, that correspond to candidate surfaces where a CAD extrusion step could be started. Our framework has three major components. Geometric guidance computation extracts the two types of geometric guidance. Single-step reconstruction computes a single candidate CAD modeling step for each provided prompt. Single-step selection selects among the candidate CAD modeling steps. The process continues until the reconstruction is completed. Our quantitative results show a significant improvement across all metrics. For example, on the dataset DeepCAD, PS-CAD improves upon the best published SOTA method by reducing the geometry errors (CD and HD) by 10%, and the structural error (ECD metric) by about 15%.

Via

Access Paper or Ask Questions

FreeTuner: Any Subject in Any Style with Training-free Diffusion

May 23, 2024

Youcan Xu, Zhen Wang, Jun Xiao, Wei Liu, Long Chen

Abstract:With the advance of diffusion models, various personalized image generation methods have been proposed. However, almost all existing work only focuses on either subject-driven or style-driven personalization. Meanwhile, state-of-the-art methods face several challenges in realizing compositional personalization, i.e., composing different subject and style concepts, such as concept disentanglement, unified reconstruction paradigm, and insufficient training data. To address these issues, we introduce FreeTuner, a flexible and training-free method for compositional personalization that can generate any user-provided subject in any user-provided style (see Figure 1). Our approach employs a disentanglement strategy that separates the generation process into two stages to effectively mitigate concept entanglement. FreeTuner leverages the intermediate features within the diffusion model for subject concept representation and introduces style guidance to align the synthesized images with the style concept, ensuring the preservation of both the subject's structure and the style's aesthetic features. Extensive experiments have demonstrated the generation ability of FreeTuner across various personalization settings.

Via

Access Paper or Ask Questions

Neural Interaction Energy for Multi-Agent Trajectory Prediction

Apr 25, 2024

Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

Figure 1 for Neural Interaction Energy for Multi-Agent Trajectory Prediction

Figure 2 for Neural Interaction Energy for Multi-Agent Trajectory Prediction

Figure 3 for Neural Interaction Energy for Multi-Agent Trajectory Prediction

Figure 4 for Neural Interaction Energy for Multi-Agent Trajectory Prediction

Abstract:Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.

Via

Access Paper or Ask Questions

AudioScenic: Audio-Driven Video Scene Editing

Apr 25, 2024

Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

Figure 1 for AudioScenic: Audio-Driven Video Scene Editing

Figure 2 for AudioScenic: Audio-Driven Video Scene Editing

Figure 3 for AudioScenic: Audio-Driven Video Scene Editing

Figure 4 for AudioScenic: Audio-Driven Video Scene Editing

Abstract:Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude, enhancing the visual dynamics. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes, thus improving the overall temporal coherence of the edited videos. These integrated features enable AudioScenic to not only enhance visual diversity but also maintain temporal consistency throughout the video. We present a new metric named temporal score for more comprehensive validation of temporal consistency. We demonstrate substantial advancements of AudioScenic over competing methods on DAVIS and Audioset datasets.

Via

Access Paper or Ask Questions

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Apr 24, 2024

Jiaming Lei, Lin Li, Chunping Wang, Jun Xiao, Long Chen

Figure 1 for Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Figure 2 for Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Figure 3 for Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Figure 4 for Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Abstract:Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Via

Access Paper or Ask Questions

Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration

Mar 21, 2024

Zhihao Wang, Yulin Zhou, Ningyu Zhang, Xiaosong Yang, Jun Xiao, Zhao Wang

Figure 1 for Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration

Figure 2 for Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration

Figure 3 for Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration

Figure 4 for Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration

Abstract:Human motion prediction is consisting in forecasting future body poses from historically observed sequences. It is a longstanding challenge due to motion's complex dynamics and uncertainty. Existing methods focus on building up complicated neural networks to model the motion dynamics. The predicted results are required to be strictly similar to the training samples with L2 loss in current training pipeline. However, little attention has been paid to the uncertainty property which is crucial to the prediction task. We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result. In addition, existing works calculate the predicted error on each future frame equally during training, while recent work indicated that different frames could play different roles. In this work, a novel computationally efficient encoder-decoder model with uncertainty consideration is proposed, which could learn proper characteristics for future frames by a dynamic function. Experimental results on benchmark datasets demonstrate that our uncertainty consideration approach has obvious advantages both in quantity and quality. Moreover, the proposed method could produce motion sequences with much better quality that avoids the intractable shaking artefacts. We believe our work could provide a novel perspective to consider the uncertainty quality for the general motion prediction task and encourage the studies in this field. The code will be available in https://github.com/Motionpre/Adaptive-Salient-Loss-SAGGB.

* Accepted by AAAI2024

Via

Access Paper or Ask Questions

Distributionally Generative Augmentation for Fair Facial Attribute Classification

Mar 11, 2024

Fengda Zhang, Qianpei He, Kun Kuang, Jiashuo Liu, Long Chen, Chao Wu, Jun Xiao, Hanwang Zhang

Figure 1 for Distributionally Generative Augmentation for Fair Facial Attribute Classification

Figure 2 for Distributionally Generative Augmentation for Fair Facial Attribute Classification

Figure 3 for Distributionally Generative Augmentation for Fair Facial Attribute Classification

Figure 4 for Distributionally Generative Augmentation for Fair Facial Attribute Classification

Abstract:Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However, FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data, where some spurious attributes (e.g., Male) statistically correlate with the target attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes, which may be unavailable in practice. This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially, we identify the potential spurious attributes based on generative models. Notably, it enhances interpretability by explicitly showing the spurious attributes in image space. Following this, for each image, we first edit the spurious attributes with a random degree sampled from a uniform distribution, while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA.

* CVPR 2024

Via

Access Paper or Ask Questions