Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingjin Chen

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Apr 13, 2026

Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen, Shaohui Wang, Xiaoxiao Long, Fei Ma, Qi Tian, Ruqi Huang(+1 more)

Abstract:Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

* Accepted by CVPR 2026. Project Page: https://lottiegpt.github.io/

Via

Access Paper or Ask Questions

HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

Mar 31, 2026

Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, Yi Wang

Abstract:Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.

* Project page: https://hvg3d.github.io/

Via

Access Paper or Ask Questions

DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

May 23, 2025

Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao(+2 more)

Abstract:Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

* Our video demos and code are available at https://DanceTog.github.io/

Via

Access Paper or Ask Questions

Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Mar 18, 2024

Mingjin Chen, Junhao Chen, Xiaojun Ye, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, Hao Zhao

Figure 1 for Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Figure 2 for Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Figure 3 for Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Figure 4 for Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Abstract:3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.

* Project Page: https://air-discover.github.io/Ultraman/

Via

Access Paper or Ask Questions

Abdominal multi-organ segmentation in CT using Swinunter

Sep 28, 2023

Mingjin Chen, Yongkang He, Yongyi Lu

Figure 1 for Abdominal multi-organ segmentation in CT using Swinunter

Figure 2 for Abdominal multi-organ segmentation in CT using Swinunter

Figure 3 for Abdominal multi-organ segmentation in CT using Swinunter

Figure 4 for Abdominal multi-organ segmentation in CT using Swinunter

Abstract:Abdominal multi-organ segmentation in computed tomography (CT) is crucial for many clinical applications including disease detection and treatment planning. Deep learning methods have shown unprecedented performance in this perspective. However, it is still quite challenging to accurately segment different organs utilizing a single network due to the vague boundaries of organs, the complex background, and the substantially different organ size scales. In this work we used make transformer-based model for training. It was found through previous years' competitions that basically all of the top 5 methods used CNN-based methods, which is likely due to the lack of data volume that prevents transformer-based methods from taking full advantage. The thousands of samples in this competition may enable the transformer-based model to have more excellent results. The results on the public validation set also show that the transformer-based model can achieve an acceptable result and inference time.

* 8pages. arXiv admin note: text overlap with arXiv:2201.01266 by other authors

Via

Access Paper or Ask Questions

Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Aug 17, 2023

Mingjin Chen, Yongkang He, Yongyi Lu, Zhijing Yang

Figure 1 for Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Figure 2 for Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Figure 3 for Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Figure 4 for Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Abstract:We aim at incorporating explicit shape information into current 3D organ segmentation models. Different from previous works, we formulate shape learning as an in-painting task, which is named Masked Label Mask Modeling (MLM). Through MLM, learnable mask tokens are fed into transformer blocks to complete the label mask of organ. To transfer MLM shape knowledge to target, we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts with at least 1.2 points gain in the Dice score, demonstrating the effectiveness of our method in challenging unsupervised domain adaptation scenarios including: (1) In-domain organ segmentation; (2) Unseen domain segmentation and (3) Unseen organ segmentation. We hope this work will advance shape analysis and geometric learning in medical imaging.

Via

Access Paper or Ask Questions

A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Aug 02, 2023

Yongkang He, Siyuan Peng, Mingjin Chen, Zhijing Yang, Yuanhui Chen

Figure 1 for A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Figure 2 for A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Figure 3 for A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Figure 4 for A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Abstract:Accurately predicting anesthetic effects is essential for target-controlled infusion systems. The traditional (PK-PD) models for Bispectral index (BIS) prediction require manual selection of model parameters, which can be challenging in clinical settings. Recently proposed deep learning methods can only capture general trends and may not predict abrupt changes in BIS. To address these issues, we propose a transformer-based method for predicting the depth of anesthesia (DOA) using drug infusions of propofol and remifentanil. Our method employs long short-term memory (LSTM) and gate residual network (GRN) networks to improve the efficiency of feature fusion and applies an attention mechanism to discover the interactions between the drugs. We also use label distribution smoothing and reweighting losses to address data imbalance. Experimental results show that our proposed method outperforms traditional PK-PD models and previous deep learning methods, effectively predicting anesthetic depth under sudden and deep anesthesia conditions.

Via

Access Paper or Ask Questions

Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Aug 02, 2023

Yongkang He, Mingjin Chen, Zhijing Yang, Yongyi Lu

Figure 1 for Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Figure 2 for Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Figure 3 for Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Figure 4 for Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Abstract:This paper seeks to address the dense labeling problems where a significant fraction of the dataset can be pruned without sacrificing much accuracy. We observe that, on standard medical image segmentation benchmarks, the loss gradient norm-based metrics of individual training examples applied in image classification fail to identify the important samples. To address this issue, we propose a data pruning method by taking into consideration the training dynamics on target regions using Dynamic Average Dice (DAD) score. To the best of our knowledge, we are among the first to address the data importance in dense labeling tasks in the field of medical image analysis, making the following contributions: (1) investigating the underlying causes with rigorous empirical analysis, and (2) determining effective data pruning approach in dense labeling problems. Our solution can be used as a strong yet simple baseline to select important examples for medical image segmentation with combined data sources.

* Accepted by ICML workshops 2023

Via

Access Paper or Ask Questions