Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dacheng Tao

and Other Contributors

CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Jan 21, 2025

Zonglei Jing, Zonghao Ying, Le Wang, Siyuan Liang, Aishan Liu, Xianglong Liu, Dacheng Tao

Figure 1 for CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Figure 2 for CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Figure 3 for CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Figure 4 for CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Abstract:The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62\% on average).

Via

Access Paper or Ask Questions

Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Jan 16, 2025

Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Bo Du, Dacheng Tao

Figure 1 for Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Figure 2 for Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Figure 3 for Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Figure 4 for Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Abstract:Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approaches. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings.

Via

Access Paper or Ask Questions

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Jan 16, 2025

Yang Chen, Jingcai Guo, Song Guo, Jingren Zhou, Dacheng Tao

Figure 1 for Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Figure 2 for Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Figure 3 for Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Figure 4 for Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Abstract:Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.

* 15 pages, 9 figures

Via

Access Paper or Ask Questions

Modeling All Response Surfaces in One for Conditional Search Spaces

Jan 08, 2025

Jiaxing Li, Wei Liu, Chao Xue, Yibing Zhan, Xiaoxing Wang, Weifeng Liu, Dacheng Tao

Figure 1 for Modeling All Response Surfaces in One for Conditional Search Spaces

Figure 2 for Modeling All Response Surfaces in One for Conditional Search Spaces

Figure 3 for Modeling All Response Surfaces in One for Conditional Search Spaces

Figure 4 for Modeling All Response Surfaces in One for Conditional Search Spaces

Abstract:Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.

Via

Access Paper or Ask Questions

Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions

Jan 03, 2025

Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, Dacheng Tao

Abstract:Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive motion annotations, existing algorithms can not simultaneously control the motions of both camera and objects, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse objects and environments and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To validate the effectiveness and generalization of SynFMC, we further propose a method, Free-Form Motion Control (FMC). FMC enables independent or simultaneous control of object and camera movements, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.

* Project Page: https://henghuiding.github.io/SynFMC/

Via

Access Paper or Ask Questions

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Dec 24, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen(+1 more)

Figure 1 for Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Figure 2 for Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Figure 3 for Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Figure 4 for Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Abstract:In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code will be available at https://github.com/HJYao00/Mulberry

* Technical report

Via

Access Paper or Ask Questions

Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

Dec 16, 2024

Siyuan Liang, Jiajun Gong, Tianmeng Fang, Aishan Liu, Tao Wang, Xianglong Liu, Xiaochun Cao, Dacheng Tao, Chang Ee-Chien

Figure 1 for Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

Figure 2 for Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

Figure 3 for Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

Figure 4 for Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

Abstract:Website fingerprint (WF) attacks, which covertly monitor user communications to identify the web pages they visit, pose a serious threat to user privacy. Existing WF defenses attempt to reduce the attacker's accuracy by disrupting unique traffic patterns; however, they often suffer from the trade-off between overhead and effectiveness, resulting in less usefulness in practice. To overcome this limitation, we introduce Controllable Website Fingerprint Defense (CWFD), a novel defense perspective based on backdoor learning. CWFD exploits backdoor vulnerabilities in neural networks to directly control the attacker's model by designing trigger patterns based on network traffic. Specifically, CWFD injects only incoming packets on the server side into the target web page's traffic, keeping overhead low while effectively poisoning the attacker's model during training. During inference, the defender can influence the attacker's model through a 'red pill, blue pill' choice: traces with the trigger (red pill) lead to misclassification as the target web page, while normal traces (blue pill) are classified correctly, achieving directed control over the defense outcome. We use the Fast Levenshtein-like distance as the optimization objective to compute trigger patterns that can be effectively associated with our target page. Experiments show that CWFD significantly reduces RF's accuracy from 99% to 6% with 74% data overhead. In comparison, FRONT reduces accuracy to only 97% at similar overhead, while Palette achieves 32% accuracy with 48% more overhead. We further validate the practicality of our method in a real Tor network environment.

* 18 pages, 7 figures

Via

Access Paper or Ask Questions

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Dec 16, 2024

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao

Figure 1 for AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Figure 2 for AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Figure 3 for AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Figure 4 for AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Abstract:Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

EMOv2: Pushing 5M Vision Model Frontier

Dec 09, 2024

Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao

Figure 1 for EMOv2: Pushing 5M Vision Model Frontier

Figure 2 for EMOv2: Pushing 5M Vision Model Frontier

Figure 3 for EMOv2: Pushing 5M Vision Model Frontier

Figure 4 for EMOv2: Pushing 5M Vision Model Frontier

Abstract:This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

Via

Access Paper or Ask Questions

Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Dec 03, 2024

Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao

Figure 1 for Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Figure 2 for Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Figure 3 for Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Figure 4 for Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Abstract:Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.

Via

Access Paper or Ask Questions