Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chao Xu

School of Software, Tianjin University

Universal Trajectory Optimization Framework for Differential-Driven Robot Class

Sep 12, 2024

Mengke Zhang, Zhichao Han, Chao Xu, Fei Gao, Yanjun Cao

Figure 1 for Universal Trajectory Optimization Framework for Differential-Driven Robot Class

Figure 2 for Universal Trajectory Optimization Framework for Differential-Driven Robot Class

Figure 3 for Universal Trajectory Optimization Framework for Differential-Driven Robot Class

Figure 4 for Universal Trajectory Optimization Framework for Differential-Driven Robot Class

Abstract:Differential-driven robots are widely used in various scenarios thanks to their straightforward principle, from household service robots to disaster response field robots. There are several different types of deriving mechanisms considering the real-world applications, including two-wheeled, four-wheeled skid-steering, tracked robots, etc. The differences in the driving mechanism usually require specific kinematic modeling when precise controlling is desired. Furthermore, the nonholonomic dynamics and possible lateral slip lead to different degrees of difficulty in getting feasible and high-quality trajectories. Therefore, a comprehensive trajectory optimization framework to compute trajectories efficiently for various kinds of differential-driven robots is highly desirable. In this paper, we propose a universal trajectory optimization framework that can be applied to differential-driven robot class, enabling the generation of high-quality trajectories within a restricted computational timeframe. We introduce a novel trajectory representation based on polynomial parameterization of motion states or their integrals, such as angular and linear velocities, that inherently matching robots' motion to the control principle for differential-driven robot class. The trajectory optimization problem is formulated to minimize complexity while prioritizing safety and operational efficiency. We then build a full-stack autonomous planning and control system to show the feasibility and robustness. We conduct extensive simulations and real-world testing in crowded environments with three kinds of differential-driven robots to validate the effectiveness of our approach. We will release our method as an open-source package.

* 15 pages, 15 figures

Via

Access Paper or Ask Questions

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Aug 19, 2024

Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Linghao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xiaoshuai Zhang, Isabella Liu(+2 more)

Figure 1 for MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Figure 2 for MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Figure 3 for MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Figure 4 for MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Abstract:Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry's learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks. Project page: https://meshformer3d.github.io

* 20 pages, 9 figures

Via

Access Paper or Ask Questions

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Aug 19, 2024

Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, Minghua Liu

Figure 1 for SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Figure 2 for SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Figure 3 for SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Figure 4 for SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Abstract:Open-world 3D generation has recently attracted considerable attention. While many single-image-to-3D methods have yielded visually appealing outcomes, they often lack sufficient controllability and tend to produce hallucinated regions that may not align with users' expectations. In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. The diffusion model is trained to jointly predict surrogate representations for camera poses and multi-view images of the object under known poses, integrating all information from the input sparse views. These predictions are then leveraged to accomplish 3D reconstruction and pose estimation, and the reconstructed 3D model can be used to further refine the camera poses of input views. Through extensive experiments on three datasets, we demonstrate that our method not only significantly outperforms baseline methods in terms of 3D reconstruction quality and pose prediction accuracy but also exhibits strong efficiency. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views. Project page: https://chaoxu.xyz/sparp.

* ECCV 2024

Via

Access Paper or Ask Questions

Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Aug 18, 2024

Chao Xu, Mingze Sun, Zhi-Qi Cheng, Fei Wang, Yang Liu, Baigui Sun, Ruqi Huang, Alexander Hauptmann

Figure 1 for Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Figure 2 for Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Figure 3 for Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Figure 4 for Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Abstract:In this paper, we propose a novel framework, Combo, for harmonious co-speech holistic 3D human motion generation and efficient customizable adaption. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output (MIMO) nature of the generative model of interest. More concretely, on the input end, the model typically consumes both speech signals and character guidance (e.g., identity and emotion), which not only poses challenge on learning capacity but also hinders further adaptation to varying guidance; on the output end, holistic human motions mainly consist of facial expressions and body movements, which are inherently correlated but non-trivial to coordinate in current data-driven generation process. In response to the above challenge, we propose tailored designs to both ends. For the former, we propose to pre-train on data regarding a fixed identity with neutral emotion, and defer the incorporation of customizable conditions (identity and emotion) to fine-tuning stage, which is boosted by our novel X-Adapter for parameter-efficient fine-tuning. For the latter, we propose a simple yet effective transformer design, DU-Trans, which first divides into two branches to learn individual features of face expression and body movements, and then unites those to learn a joint bi-directional distribution and directly predicts combined coefficients. Evaluated on BEAT2 and SHOW datasets, Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion. Project website: \href{https://xc-csc101.github.io/combo/}{Combo}.

Via

Access Paper or Ask Questions

Optical RISs Improve the Secret Key Rate of Free-Space QKD in HAP-to-UAV Scenarios

Aug 12, 2024

Phuc V. Trinh, Shinya Sugiura, Chao Xu, Lajos Hanzo

Figure 1 for Optical RISs Improve the Secret Key Rate of Free-Space QKD in HAP-to-UAV Scenarios

Figure 2 for Optical RISs Improve the Secret Key Rate of Free-Space QKD in HAP-to-UAV Scenarios

Figure 3 for Optical RISs Improve the Secret Key Rate of Free-Space QKD in HAP-to-UAV Scenarios

Figure 4 for Optical RISs Improve the Secret Key Rate of Free-Space QKD in HAP-to-UAV Scenarios

Abstract:Large optical reconfigurable intelligent surfaces (ORISs) are proposed for employment on building rooftops to facilitate free-space quantum key distribution (QKD) between high-altitude platforms (HAPs) and low-altitude platforms (LAPs). Due to practical constraints, the communication terminals can only be positioned beneath the LAPs, preventing direct upward links to HAPs. By deploying ORISs on rooftops to reflect the beam arriving from HAPs towards LAPs from below, reliable HAP-to-LAP links can be established. To accurately characterize the optical beam propagation, we develop an analytical channel model based on extended Huygens-Fresnel principles for representing both the atmospheric turbulence effects and the hovering fluctuations of LAPs. This model facilitates adaptive ORIS beam-width control through linear, quadratic, and focusing phase shifts, which are capable of effectively mitigating the detrimental effects of beam broadening and pointing errors (PE). Furthermore, we derive a closed-form expression for the information-theoretic bound of the QKD secret key rate (SKR) of the HAP-to-LAP links. Our findings demonstrate that quadratic phase shifts enhance the SKR at high HAP-ORIS zenith angles or mild PE conditions by narrowing the beam to optimal sizes. By contrast, linear phase shifts are advantageous at low HAP-ORIS zenith angles under moderate-to-high PE by diverging the beam to mitigate LAP fluctuations.

* 16 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

LF-3PM: a LiDAR-based Framework for Perception-aware Planning with Perturbation-induced Metric

Aug 03, 2024

Kaixin Chai, Long Xu, Qianhao Wang, Chao Xu, Peng Yin, Fei Gao

Figure 1 for LF-3PM: a LiDAR-based Framework for Perception-aware Planning with Perturbation-induced Metric

Figure 2 for LF-3PM: a LiDAR-based Framework for Perception-aware Planning with Perturbation-induced Metric

Figure 3 for LF-3PM: a LiDAR-based Framework for Perception-aware Planning with Perturbation-induced Metric

Figure 4 for LF-3PM: a LiDAR-based Framework for Perception-aware Planning with Perturbation-induced Metric

Abstract:Just as humans can become disoriented in featureless deserts or thick fogs, not all environments are conducive to the Localization Accuracy and Stability (LAS) of autonomous robots. This paper introduces an efficient framework designed to enhance LiDAR-based LAS through strategic trajectory generation, known as Perception-aware Planning. Unlike vision-based frameworks, the LiDAR-based requires different considerations due to unique sensor attributes. Our approach focuses on two main aspects: firstly, assessing the impact of LiDAR observations on LAS. We introduce a perturbation-induced metric to provide a comprehensive and reliable evaluation of LiDAR observations. Secondly, we aim to improve motion planning efficiency. By creating a Static Observation Loss Map (SOLM) as an intermediary, we logically separate the time-intensive evaluation and motion planning phases, significantly boosting the planning process. In the experimental section, we demonstrate the effectiveness of the proposed metrics across various scenes and the feature of trajectories guided by different metrics. Ultimately, our framework is tested in a real-world scenario, enabling the robot to actively choose topologies and orientations preferable for localization. The source code is accessible at https://github.com/ZJU-FAST-Lab/LF-3PM.

Via

Access Paper or Ask Questions

WING: Wheel-Inertial Neural Odometry with Ground Manifold Constraints

Jul 14, 2024

Chenxing Jiang, Kunyi Zhang, Sheng Yang, Shaojie Shen, Chao Xu, Fei Gao

Figure 1 for WING: Wheel-Inertial Neural Odometry with Ground Manifold Constraints

Figure 2 for WING: Wheel-Inertial Neural Odometry with Ground Manifold Constraints

Figure 3 for WING: Wheel-Inertial Neural Odometry with Ground Manifold Constraints

Figure 4 for WING: Wheel-Inertial Neural Odometry with Ground Manifold Constraints

Abstract:In this paper, we propose an interoceptive-only odometry system for ground robots with neural network processing and soft constraints based on the assumption of a globally continuous ground manifold. Exteroceptive sensors such as cameras, GPS and LiDAR may encounter difficulties in scenarios with poor illumination, indoor environments, dusty areas and straight tunnels. Therefore, improving the pose estimation accuracy only using interoceptive sensors is important to enhance the reliability of navigation system even in degrading scenarios mentioned above. However, interoceptive sensors like IMU and wheel encoders suffer from large drift due to noisy measurements. To overcome these challenges, the proposed system trains deep neural networks to correct the measurements from IMU and wheel encoders, while considering their uncertainty. Moreover, because ground robots can only travel on the ground, we model the ground surface as a globally continuous manifold using a dual cubic B-spline manifold to further improve the estimation accuracy by this soft constraint. A novel space-based sliding-window filtering framework is proposed to fully exploit the $C^2$ continuity of ground manifold soft constraints and fuse all the information from raw measurements and neural networks in a yaw-independent attitude convention. Extensive experiments demonstrate that our proposed approach can outperform state-of-the-art learning-based interoceptive-only odometry methods.

* Accepted by IEEE Transactions on Intelligent Vehicles

Via

Access Paper or Ask Questions

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Jul 11, 2024

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, Yebin Liu

Figure 1 for MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Figure 2 for MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Figure 3 for MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Figure 4 for MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Abstract:We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and texture. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and plausible material decomposition, inherently supporting editing, manipulation or relighting operations.

* Project Page: https://shad0wta9.github.io/meshavatar-page/

Via

Access Paper or Ask Questions

Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Jun 30, 2024

Yuchuan Tian, Jianhong Han, Hanting Chen, Yuanyuan Xi, Guoyang Zhang, Jie Hu, Chao Xu, Yunhe Wang

Figure 1 for Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Figure 2 for Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Figure 3 for Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Figure 4 for Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Abstract:Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at https://github.com/huawei-noah/Pretrained-IPT.

* 15 pages, 4 figures

Via

Access Paper or Ask Questions

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Jun 17, 2024

Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye(+12 more)

Figure 1 for DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Figure 2 for DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Figure 3 for DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Figure 4 for DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Abstract:Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics: 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.

* Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures

Via

Access Paper or Ask Questions