Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianqi Chen

MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

Jun 15, 2026

Jianqi Chen, Jiraphon Yenphraphai, Xiangjun Tang, Sergey Tulyakov, Chaoyang Wang, Peter Wonka, Rameen Abdal

Abstract:We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder--decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: https://meshloom.github.io/ .

* Project page: https://meshloom.github.io/

Via

Access Paper or Ask Questions

Helix4D: Complex 4D Mesh Generation

May 25, 2026

Jiraphon Yenphraphai, Jianqi Chen, Jian Wang, Gordon Qian, Sergey Tulyakov, Rameen Abdal, Raymond A. Yeh, Peter Wonka, Chaoyang Wang

Abstract:Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.

* Project page: https://snap-research.github.io/helix4d/

Via

Access Paper or Ask Questions

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Dec 11, 2025

Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

Abstract:6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

* Project page: https://windvchen.github.io/PoseGAM/

Via

Access Paper or Ask Questions

V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Mar 11, 2025

Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

Figure 1 for V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Figure 2 for V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Figure 3 for V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Figure 4 for V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Abstract:We present V2M4, a novel 4D reconstruction method that directly generates a usable 4D mesh animation asset from a single monocular video. Unlike existing approaches that rely on priors from multi-view image and video generation models, our method is based on native 3D mesh generation models. Naively applying 3D mesh generation models to generate a mesh for each frame in a 4D task can lead to issues such as incorrect mesh poses, misalignment of mesh appearance, and inconsistencies in mesh geometry and texture maps. To address these problems, we propose a structured workflow that includes camera search and mesh reposing, condition embedding optimization for mesh appearance refinement, pairwise mesh registration for topology consistency, and global texture map optimization for texture consistency. Our method outputs high-quality 4D animated assets that are compatible with mainstream graphics and game software. Experimental results across a variety of animation types and motion amplitudes demonstrate the generalization and effectiveness of our method. Project page:https://windvchen.github.io/V2M4/.

* Project page:https://windvchen.github.io/V2M4/

Via

Access Paper or Ask Questions

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Nov 07, 2024

Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, Xiaodan Liang

Figure 1 for StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Figure 2 for StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Figure 3 for StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Figure 4 for StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Abstract:The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

Via

Access Paper or Ask Questions

Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Oct 14, 2024

Jianqi Chen, Panwen Hu, Xiaojun Chang, Zhenwei Shi, Michael Christian Kampffmeyer, Xiaodan Liang

Figure 1 for Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Figure 2 for Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Figure 3 for Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Figure 4 for Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Abstract:Recent advancements in human motion synthesis have focused on specific types of motions, such as human-scene interaction, locomotion or human-human interaction, however, there is a lack of a unified system capable of generating a diverse combination of motion types. In response, we introduce Sitcom-Crafter, a comprehensive and extendable system for human motion generation in 3D space, which can be guided by extensive plot contexts to enhance workflow efficiency for anime and game designers. The system is comprised of eight modules, three of which are dedicated to motion generation, while the remaining five are augmentation modules that ensure consistent fusion of motion sequences and system functionality. Central to the generation modules is our novel 3D scene-aware human-human interaction module, which addresses collision issues by synthesizing implicit 3D Signed Distance Function (SDF) points around motion spaces, thereby minimizing human-scene collisions without additional data collection costs. Complementing this, our locomotion and human-scene interaction modules leverage existing methods to enrich the system's motion generation capabilities. Augmentation modules encompass plot comprehension for command generation, motion synchronization for seamless integration of different motion types, hand pose retrieval to enhance motion realism, motion collision revision to prevent human collisions, and 3D retargeting to ensure visual fidelity. Experimental evaluations validate the system's ability to generate high-quality, diverse, and physically realistic motions, underscoring its potential for advancing creative workflows.

* Code Page: https://github.com/WindVChen/Sitcom-Crafter

Via

Access Paper or Ask Questions

Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Jan 03, 2024

Yilan Zhang, Yingxue Xu, Jianqi Chen, Fengying Xie, Hao Chen

Figure 1 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 2 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 3 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 4 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Abstract:Multimodal learning significantly benefits cancer survival prediction, especially the integration of pathological images and genomic data. Despite advantages of multimodal learning for cancer survival prediction, massive redundancy in multimodal data prevents it from extracting discriminative and compact information: (1) An extensive amount of intra-modal task-unrelated information blurs discriminability, especially for gigapixel whole slide images (WSIs) with many patches in pathology and thousands of pathways in genomic data, leading to an ``intra-modal redundancy" issue. (2) Duplicated information among modalities dominates the representation of multimodal data, which makes modality-specific information prone to being ignored, resulting in an ``inter-modal redundancy" issue. To address these, we propose a new framework, Prototypical Information Bottlenecking and Disentangling (PIBD), consisting of Prototypical Information Bottleneck (PIB) module for intra-modal redundancy and Prototypical Information Disentanglement (PID) module for inter-modal redundancy. Specifically, a variant of information bottleneck, PIB, is proposed to model prototypes approximating a bunch of instances for different risk levels, which can be used for selection of discriminative instances within modality. PID module decouples entangled multimodal data into compact distinct components: modality-common and modality-specific knowledge, under the guidance of the joint prototypical distribution. Extensive experiments on five cancer benchmark datasets demonstrated our superiority over other methods.

Via

Access Paper or Ask Questions

Zero-Shot Image Harmonization with Generative Model Prior

Jul 17, 2023

Jianqi Chen, Zhengxia Zou, Yilan Zhang, Keyan Chen, Zhenwei Shi

Figure 1 for Zero-Shot Image Harmonization with Generative Model Prior

Figure 2 for Zero-Shot Image Harmonization with Generative Model Prior

Figure 3 for Zero-Shot Image Harmonization with Generative Model Prior

Figure 4 for Zero-Shot Image Harmonization with Generative Model Prior

Abstract:Recent image harmonization methods have demonstrated promising results. However, due to their heavy reliance on a large number of composite images, these works are expensive in the training phase and often fail to generalize to unseen images. In this paper, we draw lessons from human behavior and come up with a zero-shot image harmonization method. Specifically, in the harmonization process, a human mainly utilizes his long-term prior on harmonious images and makes a composite image close to that prior. To imitate that, we resort to pretrained generative models for the prior of natural images. For the guidance of the harmonization direction, we propose an Attention-Constraint Text which is optimized to well illustrate the image environments. Some further designs are introduced for preserving the foreground content structure. The resulting framework, highly consistent with human behavior, can achieve harmonious results without burdensome training. Extensive experiments have demonstrated the effectiveness of our approach, and we have also explored some interesting applications.

* Code Page: https://github.com/WindVChen/Diff-Harmonization

Via

Access Paper or Ask Questions

ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Jul 09, 2023

Yilan Zhang, Jianqi Chen, Ke Wang, Fengying Xie

Figure 1 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 2 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 3 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 4 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Abstract:Skin image datasets often suffer from imbalanced data distribution, exacerbating the difficulty of computer-aided skin disease diagnosis. Some recent works exploit supervised contrastive learning (SCL) for this long-tailed challenge. Despite achieving significant performance, these SCL-based methods focus more on head classes, yet ignoring the utilization of information in tail classes. In this paper, we propose class-Enhancement Contrastive Learning (ECL), which enriches the information of minority classes and treats different classes equally. For information enhancement, we design a hybrid-proxy model to generate class-dependent proxies and propose a cycle update strategy for parameters optimization. A balanced-hybrid-proxy loss is designed to exploit relations between samples and proxies with different classes treated equally. Taking both "imbalanced data" and "imbalanced diagnosis difficulty" into account, we further present a balanced-weighted cross-entropy loss following curriculum learning schedule. Experimental results on the classification of imbalanced skin lesion data have demonstrated the superiority and effectiveness of our method.

Via

Access Paper or Ask Questions

Diffusion Models for Imperceptible and Transferable Adversarial Attack

May 14, 2023

Jianqi Chen, Hao Chen, Keyan Chen, Yilan Zhang, Zhengxia Zou, Zhenwei Shi

Figure 1 for Diffusion Models for Imperceptible and Transferable Adversarial Attack

Figure 2 for Diffusion Models for Imperceptible and Transferable Adversarial Attack

Figure 3 for Diffusion Models for Imperceptible and Transferable Adversarial Attack

Figure 4 for Diffusion Models for Imperceptible and Transferable Adversarial Attack

Abstract:Many existing adversarial attacks generate $L_p$-norm perturbations on image RGB space. Despite some achievements in transferability and attack success rate, the crafted adversarial examples are easily perceived by human eyes. Towards visual imperceptibility, some recent works explore unrestricted attacks without $L_p$-norm constraints, yet lacking transferability of attacking black-box models. In this work, we propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models. Specifically, instead of direct manipulation in pixel space, we craft perturbations in latent space of diffusion models. Combined with well-designed content-preserving structures, we can generate human-insensitive perturbations embedded with semantic clues. For better transferability, we further "deceive" the diffusion model which can be viewed as an additional recognition surrogate, by distracting its attention away from the target regions. To our knowledge, our proposed method, DiffAttack, is the first that introduces diffusion models into adversarial attack field. Extensive experiments on various model structures (including CNNs, Transformers, MLPs) and defense methods have demonstrated our superiority over other attack methods.

* Code Page: https://github.com/WindVChen/DiffAttack

Via

Access Paper or Ask Questions