Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangyang Ji

Tsinghua University, Beijing, China

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Jul 15, 2024

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Figure 1 for Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Figure 2 for Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Figure 3 for Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Figure 4 for Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Abstract:Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Jun 30, 2024

Shian Du, Xiaotian Cheng, Qi Qian, Henglu Wei, Yi Xu, Xiangyang Ji

Figure 1 for Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Figure 2 for Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Figure 3 for Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Figure 4 for Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.

Via

Access Paper or Ask Questions

Technique Report of CVPR 2024 PBDL Challenges

Jun 15, 2024

Ying Fu, Yu Li, Shaodi You, Boxin Shi, Jose Alvarez, Coert van Gemeren, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li(+91 more)

Figure 1 for Technique Report of CVPR 2024 PBDL Challenges

Figure 2 for Technique Report of CVPR 2024 PBDL Challenges

Figure 3 for Technique Report of CVPR 2024 PBDL Challenges

Figure 4 for Technique Report of CVPR 2024 PBDL Challenges

Abstract:The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.

* CVPR 2024 Workshop - PBDL Challenge Report

Via

Access Paper or Ask Questions

Spatial Annealing Smoothing for Efficient Few-shot Neural Rendering

Jun 12, 2024

Yuru Xiao, Xianming Liu, Deming Zhai, Kui Jiang, Junjun Jiang, Xiangyang Ji

Abstract:Neural Radiance Fields (NeRF) with hybrid representations have shown impressive capabilities in reconstructing scenes for view synthesis, delivering high efficiency. Nonetheless, their performance significantly drops with sparse view inputs, due to the issue of overfitting. While various regularization strategies have been devised to address these challenges, they often depend on inefficient assumptions or are not compatible with hybrid models. There is a clear need for a method that maintains efficiency and improves resilience to sparse views within a hybrid framework. In this paper, we introduce an accurate and efficient few-shot neural rendering method named Spatial Annealing smoothing regularized NeRF (SANeRF), which is specifically designed for a pre-filtering-driven hybrid representation architecture. We implement an exponential reduction of the sample space size from an initially large value. This methodology is crucial for stabilizing the early stages of the training phase and significantly contributes to the enhancement of the subsequent process of detail refinement. Our extensive experiments reveal that, by adding merely one line of code, SANeRF delivers superior rendering quality and much faster reconstruction speed compared to current few-shot NeRF methods. Notably, SANeRF outperforms FreeNeRF by 0.3 dB in PSNR on the Blender dataset, while achieving 700x faster reconstruction speed.

Via

Access Paper or Ask Questions

CompetEvo: Towards Morphological Evolution from Competition

May 28, 2024

Kangyao Huang, Di Guo, Xinyu Zhang, Xiangyang Ji, Huaping Liu

Figure 1 for CompetEvo: Towards Morphological Evolution from Competition

Figure 2 for CompetEvo: Towards Morphological Evolution from Competition

Figure 3 for CompetEvo: Towards Morphological Evolution from Competition

Figure 4 for CompetEvo: Towards Morphological Evolution from Competition

Abstract:Training an agent to adapt to specific tasks through co-optimization of morphology and control has widely attracted attention. However, whether there exists an optimal configuration and tactics for agents in a multiagent competition scenario is still an issue that is challenging to definitively conclude. In this context, we propose competitive evolution (CompetEvo), which co-evolves agents' designs and tactics in confrontation. We build arenas consisting of three animals and their evolved derivatives, placing agents with different morphologies in direct competition with each other. The results reveal that our method enables agents to evolve a more suitable design and strategy for fighting compared to fixed-morph agents, allowing them to obtain advantages in combat scenarios. Moreover, we demonstrate the amazing and impressive behaviors that emerge when confrontations are conducted under asymmetrical morphs.

Via

Access Paper or Ask Questions

The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks

May 14, 2024

Ziquan Liu, Yufei Cui, Yan Yan, Yi Xu, Xiangyang Ji, Xue Liu, Antoni B. Chan

Figure 1 for The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks

Figure 2 for The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks

Figure 3 for The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks

Figure 4 for The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks

Abstract:In safety-critical applications such as medical imaging and autonomous driving, where decisions have profound implications for patient health and road safety, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks and reliable uncertainty quantification in decision-making. With extensive research focused on enhancing adversarial robustness through various forms of adversarial training (AT), a notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models. To address this gap, this study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks within the adversarial defense community. It is first unveiled that existing CP methods do not produce informative prediction sets under the commonly used $l_{\infty}$-norm bounded attack if the model is not adversarially trained, which underpins the importance of adversarial training for CP. Our paper next demonstrates that the prediction set size (PSS) of CP using adversarially trained models with AT variants is often worse than using standard AT, inspiring us to research into CP-efficient AT for improved PSS. We propose to optimize a Beta-weighting loss with an entropy minimization regularizer during AT to improve CP-efficiency, where the Beta-weighting loss is shown to be an upper bound of PSS at the population level by our theoretical analysis. Moreover, our empirical study on four image classification datasets across three popular AT baselines validates the effectiveness of the proposed Uncertainty-Reducing AT (AT-UR).

* ICML2024

Via

Access Paper or Ask Questions

Mesh Denoising Transformer

May 10, 2024

Wenbo Zhao, Xianming Liu, Deming Zhai, Junjun Jiang, Xiangyang Ji

Abstract:Mesh denoising, aimed at removing noise from input meshes while preserving their feature structures, is a practical yet challenging task. Despite the remarkable progress in learning-based mesh denoising methodologies in recent years, their network designs often encounter two principal drawbacks: a dependence on single-modal geometric representations, which fall short in capturing the multifaceted attributes of meshes, and a lack of effective global feature aggregation, hindering their ability to fully understand the mesh's comprehensive structure. To tackle these issues, we propose SurfaceFormer, a pioneering Transformer-based mesh denoising framework. Our first contribution is the development of a new representation known as Local Surface Descriptor, which is crafted by establishing polar systems on each mesh face, followed by sampling points from adjacent surfaces using geodesics. The normals of these points are organized into 2D patches, mimicking images to capture local geometric intricacies, whereas the poles and vertex coordinates are consolidated into a point cloud to embody spatial information. This advancement surmounts the hurdles posed by the irregular and non-Euclidean characteristics of mesh data, facilitating a smooth integration with Transformer architecture. Next, we propose a dual-stream structure consisting of a Geometric Encoder branch and a Spatial Encoder branch, which jointly encode local geometry details and spatial information to fully explore multimodal information for mesh denoising. A subsequent Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation through self-attention operators. Our experimental evaluations demonstrate that this novel approach outperforms existing state-of-the-art methods in both objective and subjective assessments, marking a significant leap forward in mesh denoising.

Via

Access Paper or Ask Questions

RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Apr 05, 2024

Xingyu Liu, Chenyangguang Zhang, Gu Wang, Ruida Zhang, Xiangyang Ji

Figure 1 for RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Figure 2 for RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Figure 3 for RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Figure 4 for RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Abstract:In robotic vision, a de-facto paradigm is to learn in simulated environments and then transfer to real-world applications, which poses an essential challenge in bridging the sim-to-real domain gap. While mainstream works tackle this problem in the RGB domain, we focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim). In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. A range-aware rendering strategy is further introduced to enrich data diversity. Extensive experiments show that models trained with RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.

* accepted by ICRA'24

Via

Access Paper or Ask Questions

SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance

Apr 01, 2024

Yuru Xiao, Xianming Liu, Deming Zhai, Kui Jiang, Junjun Jiang, Xiangyang Ji

Abstract:Neural Radiance Field (NeRF) technology has made significant strides in creating novel viewpoints. However, its effectiveness is hampered when working with sparsely available views, often leading to performance dips due to overfitting. FreeNeRF attempts to overcome this limitation by integrating implicit geometry regularization, which incrementally improves both geometry and textures. Nonetheless, an initial low positional encoding bandwidth results in the exclusion of high-frequency elements. The quest for a holistic approach that simultaneously addresses overfitting and the preservation of high-frequency details remains ongoing. This study introduces a novel feature matching based sparse geometry regularization module. This module excels in pinpointing high-frequency keypoints, thereby safeguarding the integrity of fine details. Through progressive refinement of geometry and textures across NeRF iterations, we unveil an effective few-shot neural rendering architecture, designated as SGCNeRF, for enhanced novel view synthesis. Our experiments demonstrate that SGCNeRF not only achieves superior geometry-consistent outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB and 0.6 dB in PSNR on the LLFF and DTU datasets, respectively.

Via

Access Paper or Ask Questions

ParCo: Part-Coordinating Text-to-Motion Synthesis

Mar 27, 2024

Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, Xiangyang Ji

Figure 1 for ParCo: Part-Coordinating Text-to-Motion Synthesis

Figure 2 for ParCo: Part-Coordinating Text-to-Motion Synthesis

Figure 3 for ParCo: Part-Coordinating Text-to-Motion Synthesis

Figure 4 for ParCo: Part-Coordinating Text-to-Motion Synthesis

Abstract:We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at https://github.com/qrzou/ParCo .

Via

Access Paper or Ask Questions