Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Zhang

Chongqing Jinshan Science & Technology

CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Dec 12, 2024

Jianwei Cui, Yu Gu, Shihao Chen, Jie Zhang, Liping Chen, Lirong Dai

Figure 1 for CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Figure 2 for CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Figure 3 for CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Figure 4 for CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Abstract:Singing Voice Synthesis (SVS) {aims} to generate singing voices {of high} fidelity and expressiveness. {Conventional SVS systems usually utilize} an acoustic model to transform a music score into acoustic features, {followed by a vocoder to reconstruct the} singing voice. It was recently shown that end-to-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.

* Accepted by AAAI2025

Via

Access Paper or Ask Questions

FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Dec 11, 2024

Zhongyi Zhang, Jie Zhang, Wenbo Zhou, Xinghui Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu

Figure 1 for FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Figure 2 for FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Figure 3 for FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Figure 4 for FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Abstract:Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer's effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.

* 17 pages, 18 figures, under review

Via

Access Paper or Ask Questions

MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Dec 11, 2024

Yun Xing, Nhat Chung, Jie Zhang, Yue Cao, Ivor Tsang, Yang Liu, Lei Ma, Qing Guo

Figure 1 for MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Figure 2 for MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Figure 3 for MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Figure 4 for MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Abstract:Physical adversarial attacks in driving scenarios can expose critical vulnerabilities in visual perception models. However, developing such attacks remains challenging due to diverse real-world backgrounds and the requirement for maintaining visual naturality. Building upon this challenge, we reformulate physical adversarial attacks as a one-shot patch-generation problem. Our approach generates adversarial patches through a deep generative model that considers the specific scene context, enabling direct physical deployment in matching environments. The primary challenge lies in simultaneously achieving two objectives: generating adversarial patches that effectively mislead object detection systems while determining contextually appropriate placement within the scene. We propose MAGIC (Mastering Physical Adversarial Generation In Context), a novel framework powered by multi-modal LLM agents to address these challenges. MAGIC automatically understands scene context and orchestrates adversarial patch generation through the synergistic interaction of language and vision capabilities. MAGIC orchestrates three specialized LLM agents: The adv-patch generation agent (GAgent) masters the creation of deceptive patches through strategic prompt engineering for text-to-image models. The adv-patch deployment agent (DAgent) ensures contextual coherence by determining optimal placement strategies based on scene understanding. The self-examination agent (EAgent) completes this trilogy by providing critical oversight and iterative refinement of both processes. We validate our method on both digital and physical level, \ie, nuImage and manually captured real scenes, where both statistical and visual results prove that our MAGIC is powerful and effectively for attacking wide-used object detection systems.

Via

Access Paper or Ask Questions

Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Dec 05, 2024

Shuhao Ma, Jie Zhang, Chaoyang Shi, Pei Di, Ian D. Robertson, Zhi-Qiang Zhang

Figure 1 for Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Figure 2 for Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Figure 3 for Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Figure 4 for Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Abstract:Computational biomechanical analysis plays a pivotal role in understanding and improving human movements and physical functions. Although physics-based modeling methods can interpret the dynamic interaction between the neural drive to muscle dynamics and joint kinematics, they suffer from high computational latency. In recent years, data-driven methods have emerged as a promising alternative due to their fast execution speed, but label information is still required during training, which is not easy to acquire in practice. To tackle these issues, this paper presents a novel physics-informed deep learning method to predict muscle forces without any label information during model training. In addition, the proposed method could also identify personalized muscle-tendon parameters. To achieve this, the Hill muscle model-based forward dynamics is embedded into the deep neural network as the additional loss to further regulate the behavior of the deep neural network. Experimental validations on the wrist joint from six healthy subjects are performed, and a fully connected neural network (FNN) is selected to implement the proposed method. The predicted results of muscle forces show comparable or even lower root mean square error (RMSE) and higher coefficient of determination compared with baseline methods, which have to use the labeled surface electromyography (sEMG) signals, and it can also identify muscle-tendon parameters accurately, demonstrating the effectiveness of the proposed physics-informed deep learning method.

* IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 32, pp. 1246-1256, 2024
* 11pages, 8 figures, journal

Via

Access Paper or Ask Questions

MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Dec 03, 2024

Aniruddha Ganguly, Debolina Chatterjee, Wentao Huang, Jie Zhang, Alisa Yurovsky, Travis Steele Johnson, Chao Chen

Figure 1 for MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Figure 2 for MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Figure 3 for MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Figure 4 for MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Abstract:Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.

* Main Paper: 8 pages, Supplementary Material: 9 pages, Figures: 16

Via

Access Paper or Ask Questions

CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Dec 03, 2024

Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, Chuang Yang, Lei Bai

Figure 1 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 2 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 3 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 4 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Abstract:Atmospheric science is intricately connected with other fields, e.g., geography and aerospace. Most existing approaches involve training a joint atmospheric and geographic model from scratch, which incurs significant computational costs and overlooks the potential for incremental learning of weather variables across different domains. In this paper, we introduce incremental learning to weather forecasting and propose a novel structure that allows for the flexible expansion of variables within the model. Specifically, our method presents a Channel-Adapted MoE (CA-MoE) that employs a divide-and-conquer strategy. This strategy assigns variable training tasks to different experts by index embedding and reduces computational complexity through a channel-wise Top-K strategy. Experiments conducted on the widely utilized ERA5 dataset reveal that our method, utilizing only approximately 15\% of trainable parameters during the incremental stage, attains performance that is on par with state-of-the-art competitors. Notably, in the context of variable incremental experiments, our method demonstrates negligible issues with catastrophic forgetting.

Via

Access Paper or Ask Questions

MRP-LLM: Multitask Reflective Large Language Models for Privacy-Preserving Next POI Recommendation

Dec 03, 2024

Ziqing Wu, Zhu Sun, Dongxia Wang, Lu Zhang, Jie Zhang, Yew Soon Ong

Figure 1 for MRP-LLM: Multitask Reflective Large Language Models for Privacy-Preserving Next POI Recommendation

Figure 2 for MRP-LLM: Multitask Reflective Large Language Models for Privacy-Preserving Next POI Recommendation

Figure 3 for MRP-LLM: Multitask Reflective Large Language Models for Privacy-Preserving Next POI Recommendation

Figure 4 for MRP-LLM: Multitask Reflective Large Language Models for Privacy-Preserving Next POI Recommendation

Abstract:Large language models (LLMs) have shown promising potential for next Point-of-Interest (POI) recommendation. However, existing methods only perform direct zero-shot prompting, leading to ineffective extraction of user preferences, insufficient injection of collaborative signals, and a lack of user privacy protection. As such, we propose a novel Multitask Reflective Large Language Model for Privacy-preserving Next POI Recommendation (MRP-LLM), aiming to exploit LLMs for better next POI recommendation while preserving user privacy. Specifically, the Multitask Reflective Preference Extraction Module first utilizes LLMs to distill each user's fine-grained (i.e., categorical, temporal, and spatial) preferences into a knowledge base (KB). The Neighbor Preference Retrieval Module retrieves and summarizes the preferences of similar users from the KB to obtain collaborative signals. Subsequently, aggregating the user's preferences with those of similar users, the Multitask Next POI Recommendation Module generates the next POI recommendations via multitask prompting. Meanwhile, during data collection, a Privacy Transmission Module is specifically devised to preserve sensitive POI data. Extensive experiments on three real-world datasets demonstrate the efficacy of our proposed MRP-LLM in providing more accurate next POI recommendations with user privacy preserved.

* 14 pages, 7 figures

Via

Access Paper or Ask Questions

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Dec 03, 2024

Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang(+4 more)

Figure 1 for Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Figure 2 for Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Figure 3 for Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Figure 4 for Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Abstract:The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.

Via

Access Paper or Ask Questions

From ChebNet to ChebGibbsNet

Dec 02, 2024

Jie Zhang, Min-Te Sun

Figure 1 for From ChebNet to ChebGibbsNet

Figure 2 for From ChebNet to ChebGibbsNet

Figure 3 for From ChebNet to ChebGibbsNet

Figure 4 for From ChebNet to ChebGibbsNet

Abstract:Recent advancements in Spectral Graph Convolutional Networks (SpecGCNs) have led to state-of-the-art performance in various graph representation learning tasks. To exploit the potential of SpecGCNs, we analyze corresponding graph filters via polynomial interpolation, the cornerstone of graph signal processing. Different polynomial bases, such as Bernstein, Chebyshev, and monomial basis, have various convergence rates that will affect the error in polynomial interpolation. Although adopting Chebyshev basis for interpolation can minimize maximum error, the performance of ChebNet is still weaker than GPR-GNN and BernNet. \textbf{We point out it is caused by the Gibbs phenomenon, which occurs when the graph frequency response function approximates the target function.} It reduces the approximation ability of a truncated polynomial interpolation. In order to mitigate the Gibbs phenomenon, we propose to add the Gibbs damping factor with each term of Chebyshev polynomials on ChebNet. As a result, our lightweight approach leads to a significant performance boost. Afterwards, we reorganize ChebNet via decoupling feature propagation and transformation. We name this variant as \textbf{ChebGibbsNet}. Our experiments indicate that ChebGibbsNet is superior to other advanced SpecGCNs, such as GPR-GNN and BernNet, in both homogeneous graphs and heterogeneous graphs.

* 12 pages, 2 figures, and 7 tables

Via

Access Paper or Ask Questions

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Nov 28, 2024

Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo

Figure 1 for SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Figure 2 for SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Figure 3 for SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Figure 4 for SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Abstract:Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

Via

Access Paper or Ask Questions