Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Chen

IEEE Fellow

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

Apr 22, 2024

Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, Hongming Shan

Abstract:Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions

Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Apr 11, 2024

Yuan Zhang, Xiaomei Tao, Hanxu Ai, Tao Chen, Yanling Gan

Figure 1 for Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Figure 2 for Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Figure 3 for Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Figure 4 for Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Abstract:In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic information of instructional videos on learners' emotional states. To deeply explore the impact of video semantic information on learners' emotions, this paper innovatively proposes a multimodal emotion recognition method by fusing video semantic information and physiological signals. We generate video descriptions through a pre-trained large language model (LLM) to obtain high-level semantic information about instructional videos. Using the cross-attention mechanism for modal interaction, the semantic information is fused with the eye movement and PhotoPlethysmoGraphy (PPG) signals to obtain the features containing the critical information of the three modes. The accurate recognition of learners' emotional states is realized through the emotion classifier. The experimental results show that our method has significantly improved emotion recognition performance, providing a new perspective and efficient method for emotion recognition research in MOOC learning scenarios. The method proposed in this paper not only contributes to a deeper understanding of the impact of instructional videos on learners' emotional states but also provides a beneficial reference for future research on emotion recognition in MOOC learning scenarios.

Via

Access Paper or Ask Questions

Adapting Multi-objectivized Software Configuration Tuning

Apr 06, 2024

Tao Chen, Miqing Li

Figure 1 for Adapting Multi-objectivized Software Configuration Tuning

Figure 2 for Adapting Multi-objectivized Software Configuration Tuning

Figure 3 for Adapting Multi-objectivized Software Configuration Tuning

Figure 4 for Adapting Multi-objectivized Software Configuration Tuning

Abstract:When tuning software configuration for better performance (e.g., latency or throughput), an important issue that many optimizers face is the presence of local optimum traps, compounded by a highly rugged configuration landscape and expensive measurements. To mitigate these issues, a recent effort has shifted to focus on the level of optimization model (called meta multi-objectivization or MMO) instead of designing better optimizers as in traditional methods. This is done by using an auxiliary performance objective, together with the target performance objective, to help the search jump out of local optima. While effective, MMO needs a fixed weight to balance the two objectives-a parameter that has been found to be crucial as there is a large deviation of the performance between the best and the other settings. However, given the variety of configurable software systems, the "sweet spot" of the weight can vary dramatically in different cases and it is not possible to find the right setting without time-consuming trial and error. In this paper, we seek to overcome this significant shortcoming of MMO by proposing a weight adaptation method, dubbed AdMMO. Our key idea is to adaptively adjust the weight at the right time during tuning, such that a good proportion of the nondominated configurations can be maintained. Moreover, we design a partial duplicate retention mechanism to handle the issue of too many duplicate configurations without losing the rich information provided by the "good" duplicates. Experiments on several real-world systems, objectives, and budgets show that, for 71% of the cases, AdMMO is considerably superior to MMO and a wide range of state-of-the-art optimizers while achieving generally better efficiency with the best speedup between 2.2x and 20x.

* This paper has been accepted at ACM FSE'24

Via

Access Paper or Ask Questions

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Mar 23, 2024

Hancheng Ye, Chong Yu, Peng Ye, Renqiu Xia, Yansong Tang, Jiwen Lu, Tao Chen, Bo Zhang

Figure 1 for Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Figure 2 for Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Figure 3 for Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Figure 4 for Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Abstract:Recent Vision Transformer Compression (VTC) works mainly follow a two-stage scheme, where the importance score of each model unit is first evaluated or preset in each submodule, followed by the sparsity score evaluation according to the target sparsity constraint. Such a separate evaluation process induces the gap between importance and sparsity score distributions, thus causing high search costs for VTC. In this work, for the first time, we investigate how to integrate the evaluations of importance and sparsity scores into a single stage, searching the optimal subnets in an efficient manner. Specifically, we present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores, termed Once for Both (OFB), for VTC. First, a bi-mask scheme is developed by entangling the importance score and the differentiable sparsity score to jointly determine the pruning potential (prunability) of each unit. Such a bi-mask search strategy is further used together with a proposed adaptive one-hot loss to realize the progressive-and-efficient search for the most important subnet. Finally, Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature space to be more representative during the search process, which may be degraded by the dimension reduction. Extensive experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods under various Vision Transformer architectures, meanwhile promoting search efficiency significantly, e.g., costing one GPU search day for the compression of DeiT-S on ImageNet-1K.

* Accepted by CVPR 2024. Our code will be available at www.github.com/HankYe/Once-for-Both

Via

Access Paper or Ask Questions

Learning Physical Dynamics for Object-centric Visual Prediction

Mar 15, 2024

Huilin Xu, Tao Chen, Feng Xu

Figure 1 for Learning Physical Dynamics for Object-centric Visual Prediction

Figure 2 for Learning Physical Dynamics for Object-centric Visual Prediction

Figure 3 for Learning Physical Dynamics for Object-centric Visual Prediction

Figure 4 for Learning Physical Dynamics for Object-centric Visual Prediction

Abstract:The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions

Enhanced Sparsification via Stimulative Training

Mar 11, 2024

Shengji Tang, Weihao Lin, Hancheng Ye, Peng Ye, Chong Yu, Baopu Li, Tao Chen

Figure 1 for Enhanced Sparsification via Stimulative Training

Figure 2 for Enhanced Sparsification via Stimulative Training

Figure 3 for Enhanced Sparsification via Stimulative Training

Figure 4 for Enhanced Sparsification via Stimulative Training

Abstract:Sparsification-based pruning has been an important category in model compression. Existing methods commonly set sparsity-inducing penalty terms to suppress the importance of dropped weights, which is regarded as the suppressed sparsification paradigm. However, this paradigm inactivates the dropped parts of networks causing capacity damage before pruning, thereby leading to performance degradation. To alleviate this issue, we first study and reveal the relative sparsity effect in emerging stimulative training and then propose a structured pruning framework, named STP, based on an enhanced sparsification paradigm which maintains the magnitude of dropped weights and enhances the expressivity of kept weights by self-distillation. Besides, to find an optimal architecture for the pruned network, we propose a multi-dimension architecture space and a knowledge distillation-guided exploration strategy. To reduce the huge capacity gap of distillation, we propose a subnet mutating expansion technique. Extensive experiments on various benchmarks indicate the effectiveness of STP. Specifically, without fine-tuning, our method consistently achieves superior performance at different budgets, especially under extremely aggressive pruning scenarios, e.g., remaining 95.11% Top-1 accuracy (72.43% in 76.15%) while reducing 85% FLOPs for ResNet-50 on ImageNet. Codes will be released soon.

* 26 pages

Via

Access Paper or Ask Questions

Low-dose CT Denoising with Language-engaged Dual-space Alignment

Mar 10, 2024

Zhihao Chen, Tao Chen, Chenhui Wang, Chuang Niu, Ge Wang, Hongming Shan

Figure 1 for Low-dose CT Denoising with Language-engaged Dual-space Alignment

Figure 2 for Low-dose CT Denoising with Language-engaged Dual-space Alignment

Figure 3 for Low-dose CT Denoising with Language-engaged Dual-space Alignment

Figure 4 for Low-dose CT Denoising with Language-engaged Dual-space Alignment

Abstract:While various deep learning methods were proposed for low-dose computed tomography (CT) denoising, they often suffer from over-smoothing, blurring, and lack of explainability. To alleviate these issues, we propose a plug-and-play Language-Engaged Dual-space Alignment loss (LEDA) to optimize low-dose CT denoising models. Our idea is to leverage large language models (LLMs) to align denoised CT and normal dose CT images in both the continuous perceptual space and discrete semantic space, which is the first LLM-based scheme for low-dose CT denoising. LEDA involves two steps: the first is to pretrain an LLM-guided CT autoencoder, which can encode a CT image into continuous high-level features and quantize them into a token space to produce semantic tokens derived from the LLM's vocabulary; and the second is to minimize the discrepancy between the denoised CT images and normal dose CT in terms of both encoded high-level features and quantized token embeddings derived by the LLM-guided CT autoencoder. Extensive experimental results on two public LDCT denoising datasets demonstrate that our LEDA can enhance existing denoising models in terms of quantitative metrics and qualitative evaluation, and also provide explainability through language-level image understanding. Source code is available at https://github.com/hao1635/LEDA.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Mar 07, 2024

Lei Li, Tianfang Zhang, Xinglin Zhang, Jiaqi Liu, Bingqi Ma, Yan Luo, Tao Chen

Abstract:Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model ability to learn from limited data, a common scenario in medical diagnostics. We verify that masking an image does not affect intermodal learning. Furthermore, we propose the SVD loss to enhance the representation learning for characteristics of medical images, aiming to improve classification accuracy by leveraging the structural intricacies of such data. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis. MedFLIP scaling of the masking process marks an advancement in the field, offering a pathway to rapid and precise medical image analysis without the traditional computational bottlenecks. Through experiments and validation, MedFLIP demonstrates efficient performance improvements, setting an explored standard for future research and application in medical diagnostics.

Via

Access Paper or Ask Questions

Cascaded Self-supervised Learning for Subject-independent EEG-based Emotion Recognition

Mar 06, 2024

Hanqi Wang, Tao Chen, Liang Song

Figure 1 for Cascaded Self-supervised Learning for Subject-independent EEG-based Emotion Recognition

Figure 2 for Cascaded Self-supervised Learning for Subject-independent EEG-based Emotion Recognition

Figure 3 for Cascaded Self-supervised Learning for Subject-independent EEG-based Emotion Recognition

Figure 4 for Cascaded Self-supervised Learning for Subject-independent EEG-based Emotion Recognition

Abstract:EEG-based Emotion recognition holds significant promise for applications in human-computer interaction, medicine, and neuroscience. While deep learning has shown potential in this field, current approaches usually rely on large-scale high-quality labeled datasets, limiting the performance of deep learning. Self-supervised learning offers a solution by automatically generating labels, but its inter-subject generalizability remains under-explored. For this reason, our interest lies in offering a self-supervised learning paradigm with better inter-subject generalizability. Inspired by recent efforts in combining low-level and high-level tasks in deep learning, we propose a cascaded self-supervised architecture for EEG emotion recognition. Then, we introduce a low-level task, time-to-frequency reconstruction (TFR). This task leverages the inherent time-frequency relationship in EEG signals. Our architecture integrates it with the high-level contrastive learning modules, performing self-supervised learning for EEG-based emotion recognition. Experiment on DEAP and DREAMER datasets demonstrates superior performance of our method over similar works. The outcome results also highlight the indispensability of the TFR task and the robustness of our method to label scarcity, validating the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

Mar 06, 2024

Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, Pulkit Agrawal

Figure 1 for Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

Figure 2 for Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

Figure 3 for Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

Figure 4 for Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

Abstract:Imitation learning methods need significant human supervision to learn policies robust to changes in object poses, physical disturbances, and visual distractors. Reinforcement learning, on the other hand, can explore the environment autonomously to learn robust behaviors but may require impractical amounts of unsafe real-world data collection. To learn performant, robust policies without the burden of unsafe real-world data collection or extensive human supervision, we propose RialTo, a system for robustifying real-world imitation learning policies via reinforcement learning in "digital twin" simulation environments constructed on the fly from small amounts of real-world data. To enable this real-to-sim-to-real pipeline, RialTo proposes an easy-to-use interface for quickly scanning and constructing digital twins of real-world environments. We also introduce a novel "inverse distillation" procedure for bringing real-world demonstrations into simulated environments for efficient fine-tuning, with minimal human intervention and engineering required. We evaluate RialTo across a variety of robotic manipulation problems in the real world, such as robustly stacking dishes on a rack, placing books on a shelf, and six other tasks. RialTo increases (over 67%) in policy robustness without requiring extensive human data collection. Project website and videos at https://real-to-sim-to-real.github.io/RialTo/

* Project page: https://real-to-sim-to-real.github.io/RialTo/

Via

Access Paper or Ask Questions