Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xin Dong

Celine

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Dec 23, 2024

Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang

Figure 1 for DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Figure 2 for DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Figure 3 for DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Figure 4 for DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Abstract:Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbf{Lightweight training}: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbf{Anything-Dressing}: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbf{Plug-and-play}: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both $768 \times 512$ high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

Via

Access Paper or Ask Questions

AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

Dec 15, 2024

Gorden Liu, Yu Sun, Ruixiao Sun, Xin Dong, Hongyu Xiong

Figure 1 for AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

Figure 2 for AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

Figure 3 for AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

Figure 4 for AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

Abstract:The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often encounter challenges when reasoning over complex, interdependent logic structures. To address this limitation, we introduce \textit{AgentPS}, a novel framework that integrates Agentic Process Supervision into MLLMs via multi-round question answering during fine-tuning. \textit{AgentPS} demonstrates significant performance improvements over baseline MLLMs on proprietary TikTok datasets, due to its integration of process supervision and structured sequential reasoning. Furthermore, we show that replacing human-annotated labels with LLM-generated labels retains much of the performance gain, highlighting the framework's practical scalability in industrial applications. These results position \textit{AgentPS} as a highly effective and efficient architecture for multimodal classification tasks. Its adaptability and scalability, especially when enhanced by automated annotation generation, make it a powerful tool for handling large-scale, real-world challenges.

* 8 pages, 2 figures

Via

Access Paper or Ask Questions

COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Dec 11, 2024

Xin Dong, Sen Jia, Hongyu Xiong

Figure 1 for COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Figure 2 for COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Figure 3 for COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Figure 4 for COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Abstract:Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework for better video quality understanding on TikTok. To this end, we first propose a MLLM fusing all visual, textual and audio signals, and then develop a cascade framework with a lightweight model as pre-filtering stage and MLLM as fine-consideration stage, significantly reducing the need for GPU resource, while retaining the performance demonstrated solely by MLLM. To demonstrate the effectiveness of COEF-VQ, we deployed this new framework onto the video management platform (VMP) at TikTok, and performed a series of detailed experiments on two in-house tasks related to video quality understanding. We show that COEF-VQ leads to substantial performance gains with limit resource consumption in these two tasks.

Via

Access Paper or Ask Questions

Hymba: A Hybrid-head Architecture for Small Language Models

Nov 20, 2024

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara(+3 more)

Figure 1 for Hymba: A Hybrid-head Architecture for Small Language Models

Figure 2 for Hymba: A Hybrid-head Architecture for Small Language Models

Figure 3 for Hymba: A Hybrid-head Architecture for Small Language Models

Figure 4 for Hymba: A Hybrid-head Architecture for Small Language Models

Abstract:We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

* 20 pages, models are available on huggingface

Via

Access Paper or Ask Questions

Design, manufacturing, and inverse dynamic modeling of soft parallel robots actuated by dielectric elastomer actuators

Sep 30, 2024

Jung-Che Chang, Xi Wang, Dragos Axinte, Xin Dong

Figure 1 for Design, manufacturing, and inverse dynamic modeling of soft parallel robots actuated by dielectric elastomer actuators

Figure 2 for Design, manufacturing, and inverse dynamic modeling of soft parallel robots actuated by dielectric elastomer actuators

Figure 3 for Design, manufacturing, and inverse dynamic modeling of soft parallel robots actuated by dielectric elastomer actuators

Figure 4 for Design, manufacturing, and inverse dynamic modeling of soft parallel robots actuated by dielectric elastomer actuators

Abstract:Soft parallel robots with their manipulation safety and low commercial cost show a promising future for delicate operations and safe human-robot interactions. However, promoting the use of electroactive polymers (EAPs) is still challenging due to the under-improving quality of the product and the dynamic modelling of the collaborations between multiple actuators. This article presents the design, fabrication, modelling and control of a parallel kinematics Delta robot actuated by dielectric elastomer actuators (DEAs). The trade-off between the actuation force and stroke is retaken by an angular stroke amplification mechanism, and the weight of the robot frame is reduced by utilizing 3D puzzling strip structures. A generic way of constructing a high-stability conductive paint on a silicon-based film has been achieved by laser scanning the DE-film and then sandwiching a conductive particle-based electrode with a paint which is mixed by the particles and photosensitive resin. Compared to the wildly used carbon grease, the fabricated electrode shows a higher consistency in its dynamic behaviour before and after the on-stand test. Finally, to predict the output force and inverse motion of the robot end effector, we constructed the inverse dynamic model by introducing an expanded Bergstrom-Boyce model to the constitutive behavior of the dielectric film. The experimental results show a prediction of robot output force with RSME of 12.4% when the end effector remains stationary, and a well-followed trajectory with less than RSME 2.5%.

* 17 pages, 12 figures

Via

Access Paper or Ask Questions

Bi-stable thin soft robot for in-plane locomotion in narrow space

Sep 30, 2024

Xi Wang, Jung-che Chang, Feiran Wang, Dragos Axinte, Xin Dong

Figure 1 for Bi-stable thin soft robot for in-plane locomotion in narrow space

Figure 2 for Bi-stable thin soft robot for in-plane locomotion in narrow space

Figure 3 for Bi-stable thin soft robot for in-plane locomotion in narrow space

Figure 4 for Bi-stable thin soft robot for in-plane locomotion in narrow space

Abstract:Dielectric elastomer actuators (DEAs), also recognized as artificial muscle, have been widely developed for the soft locomotion robot. With the complaint skeleton and miniaturized dimension, they are well suited for the narrow space inspection. In this work, we propose a novel low profile (1.1mm) and lightweight (1.8g) bi-stable in-plane DEA (Bi-DEA) constructed by supporting a dielectric elastomer onto a flat bi-stable mechanism. It has an amplified displacement and output force compared with the in-plane DEA (I-DEA) without the bi-stable mechanism. Then, the Bi-DEA is applied to a thin soft robot, using three electrostatic adhesive pads (EA-Pads) as anchoring elements. This robot is capable of crawling and climbing to access millimetre-scale narrow gaps. A theoretical model of the bi-stable mechanism and the DEA are presented. The enhanced performance of the Bi-DEA induced by the mechanism is experimentally validated. EA-Pad provides the adhesion between the actuator and the locomotion substrate, allowing crawling and climbing on various surfaces, i.e., paper and acrylic. The thin soft robot has been demonstrated to be capable of crawling through a 4mm narrow gap with a speed up to 3.3mm/s (0.07 body length per second and 2.78 body thickness per second).

* 8 pages, 12 figures

Via

Access Paper or Ask Questions

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Aug 01, 2024

Yuhang Li, Xin Dong, Chen Chen, Weiming Zhuang, Lingjuan Lyu

Figure 1 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 2 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 3 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 4 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Abstract:In computer vision, it is well-known that a lack of data diversity will impair model performance. In this study, we address the challenges of enhancing the dataset diversity problem in order to benefit various downstream tasks such as object detection and instance segmentation. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models, specifically text-to-image synthesis technologies like Stable Diffusion. Our method focuses on generating variations of labeled real images, utilizing generative object and background augmentation via inpainting to augment existing training data without the need for additional annotations. We find that background augmentation, in particular, significantly improves the models' robustness and generalization capabilities. We also investigate how to adjust the prompt and mask to ensure the generated content comply with the existing annotations. The efficacy of our augmentation techniques is validated through comprehensive evaluations of the COCO dataset and several other key object detection benchmarks, demonstrating notable enhancements in model performance across diverse scenarios. This approach offers a promising solution to the challenges of dataset enhancement, contributing to the development of more accurate and robust computer vision models.

Via

Access Paper or Ask Questions

A deeper look at depth pruning of LLMs

Jul 23, 2024

Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov

Figure 1 for A deeper look at depth pruning of LLMs

Figure 2 for A deeper look at depth pruning of LLMs

Figure 3 for A deeper look at depth pruning of LLMs

Figure 4 for A deeper look at depth pruning of LLMs

Abstract:Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.

Via

Access Paper or Ask Questions

TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

Jul 15, 2024

Xingzhi Zhou, Xin Dong, Chunhao Li, Yuning Bai, Yulong Xu, Ka Chun Cheung, Simon See, Xinpeng Song, Runshun Zhang, Xuezhong Zhou(+1 more)

Figure 1 for TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

Figure 2 for TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

Figure 3 for TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

Figure 4 for TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

Abstract:Traditional Chinese medicine (TCM) relies on specific combinations of herbs in prescriptions to treat symptoms and signs, a practice that spans thousands of years. Predicting TCM prescriptions presents a fascinating technical challenge with practical implications. However, this task faces limitations due to the scarcity of high-quality clinical datasets and the intricate relationship between symptoms and herbs. To address these issues, we introduce DigestDS, a new dataset containing practical medical records from experienced experts in digestive system diseases. We also propose a method, TCM-FTP (TCM Fine-Tuning Pre-trained), to leverage pre-trained large language models (LLMs) through supervised fine-tuning on DigestDS. Additionally, we enhance computational efficiency using a low-rank adaptation technique. TCM-FTP also incorporates data augmentation by permuting herbs within prescriptions, capitalizing on their order-agnostic properties. Impressively, TCM-FTP achieves an F1-score of 0.8031, surpassing previous methods significantly. Furthermore, it demonstrates remarkable accuracy in dosage prediction, achieving a normalized mean square error of 0.0604. In contrast, LLMs without fine-tuning perform poorly. Although LLMs have shown capabilities on a wide range of tasks, this work illustrates the importance of fine-tuning for TCM prescription prediction, and we have proposed an effective way to do that.

Via

Access Paper or Ask Questions

Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Jun 14, 2024

Hui Liu, Wenya Wang, Hao Sun, Chris Xing Tian, Chenqi Kong, Xin Dong, Haoliang Li

Figure 1 for Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Figure 2 for Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Figure 3 for Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Figure 4 for Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Abstract:Large Language Models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities from few-shot demonstration exemplars. While recent learning-based demonstration selection methods have proven beneficial to ICL by choosing more useful exemplars, their underlying mechanisms are opaque, hindering efforts to address limitations such as high training costs and poor generalization across tasks. These methods generally assume the selection process captures similarities between the exemplar and the target instance, however, it remains unknown what kinds of similarities are captured and vital to performing ICL. To dive into this question, we analyze the working mechanisms of the learning-based demonstration selection methods and empirically identify two important factors related to similarity measurement: 1) The ability to integrate different levels of task-agnostic text similarities between the input of exemplars and test cases enhances generalization power across different tasks. 2) Incorporating task-specific labels when measuring the similarities significantly improves the performance on each specific task. We validate these two findings through extensive quantitative and qualitative analyses across ten datasets and various LLMs. Based on our findings, we introduce two effective yet simplified exemplar selection methods catering to task-agnostic and task-specific demands, eliminating the costly LLM inference overhead.

Via

Access Paper or Ask Questions