Soochow University

Abstract:Federated learning has emerged as a promising paradigm for collaborative model training while preserving data privacy. However, recent studies have shown that it is vulnerable to various privacy attacks, such as data reconstruction attacks. In this paper, we provide a theoretical analysis of privacy leakage in federated learning from two perspectives: linear algebra and optimization theory. From the linear algebra perspective, we prove that when the Jacobian matrix of the batch data is not full rank, there exist different batches of data that produce the same model update, thereby ensuring a level of privacy. We derive a sufficient condition on the batch size to prevent data reconstruction attacks. From the optimization theory perspective, we establish an upper bound on the privacy leakage in terms of the batch size, the distortion extent, and several other factors. Our analysis provides insights into the relationship between privacy leakage and various aspects of federated learning, offering a theoretical foundation for designing privacy-preserving federated learning algorithms.




Abstract:Facial Expression Analysis remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit {https://norface-fea.github.io/}.
Abstract:Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is $\sim 3 \times$ faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.
Abstract:Artificial intelligence and machine learning frameworks have served as computationally efficient mapping between inputs and outputs for engineering problems. These mappings have enabled optimization and analysis routines that have warranted superior designs, ingenious material systems and optimized manufacturing processes. A common occurrence in such modeling endeavors is the existence of multiple source of data, each differentiated by fidelity, operating conditions, experimental conditions, and more. Data fusion frameworks have opened the possibility of combining such differentiated sources into single unified models, enabling improved accuracy and knowledge transfer. However, these frameworks encounter limitations when the different sources are heterogeneous in nature, i.e., not sharing the same input parameter space. These heterogeneous input scenarios can occur when the domains differentiated by complexity, scale, and fidelity require different parametrizations. Towards addressing this void, a heterogeneous multi-source data fusion framework is proposed based on input mapping calibration (IMC) and latent variable Gaussian process (LVGP). In the first stage, the IMC algorithm is utilized to transform the heterogeneous input parameter spaces into a unified reference parameter space. In the second stage, a multi-source data fusion model enabled by LVGP is leveraged to build a single source-aware surrogate model on the transformed reference space. The proposed framework is demonstrated and analyzed on three engineering case studies (design of cantilever beam, design of ellipsoidal void and modeling properties of Ti6Al4V alloy). The results indicate that the proposed framework provides improved predictive accuracy over a single source model and transformed but source unaware model.




Abstract:Recent advancements in Large Language Models (LLMs) have led to significant breakthroughs in various natural language processing tasks. However, generating factually consistent responses in knowledge-intensive scenarios remains a challenge due to issues such as hallucination, difficulty in acquiring long-tailed knowledge, and limited memory expansion. This paper introduces SMART, a novel multi-agent framework that leverages external knowledge to enhance the interpretability and factual consistency of LLM-generated responses. SMART comprises four specialized agents, each performing a specific sub-trajectory action to navigate complex knowledge-intensive tasks. We propose a multi-agent co-training paradigm, Long- and Short-Trajectory Learning, which ensures synergistic collaboration among agents while maintaining fine-grained execution by each agent. Extensive experiments on 5 tasks demonstrate SMART's superior performance compared to previous widely adopted methods.




Abstract:Existing Masked Image Modeling (MIM) depends on a spatial patch-based masking-reconstruction strategy to perceive objects'features from unlabeled images, which may face two limitations when applied to chest CT: 1) inefficient feature learning due to complex anatomical details presented in CT images, and 2) suboptimal knowledge transfer owing to input disparity between upstream and downstream models. To address these issues, we propose a new MIM method named Tissue-Contrastive Semi-Masked Autoencoder (TCS-MAE) for modeling chest CT images. Our method has two novel designs: 1) a tissue-based masking-reconstruction strategy to capture more fine-grained anatomical features, and 2) a dual-AE architecture with contrastive learning between the masked and original image views to bridge the gap of the upstream and downstream models. To validate our method, we systematically investigate representative contrastive, generative, and hybrid self-supervised learning methods on top of tasks involving segmenting pneumonia, mediastinal tumors, and various organs. The results demonstrate that, compared to existing methods, our TCS-MAE more effectively learns tissue-aware representations, thereby significantly enhancing segmentation performance across all tasks.

Abstract:The objective of face animation is to generate dynamic and expressive talking head videos from a single reference face, utilizing driving conditions derived from either video or audio inputs. Current approaches often require fine-tuning for specific identities and frequently fail to produce expressive videos due to the limited effectiveness of Wav2Pose modules. To facilitate the generation of one-shot and more consecutive talking head videos, we refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism. We subsequently optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality and expressive talking head videos. Additionally, we develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.


Abstract:In this paper, we first give an introduction to the theoretical basis of the privacy-utility equilibrium in federated learning based on Bayesian privacy definitions and total variation distance privacy definitions. We then present the \textit{Learn-to-Distort-Data} framework, which provides a principled approach to navigate the privacy-utility equilibrium by explicitly modeling the distortion introduced by the privacy-preserving mechanism as a learnable variable and optimizing it jointly with the model parameters. We demonstrate the applicability of our framework to a variety of privacy-preserving mechanisms on the basis of data distortion and highlight its connections to related areas such as adversarial training, input robustness, and unlearnable examples. These connections enable leveraging techniques from these areas to design effective algorithms for privacy-utility equilibrium in federated learning under the \textit{Learn-to-Distort-Data} framework.




Abstract:Normalizing Flows (NFs) have gained popularity among deep generative models due to their ability to provide exact likelihood estimation and efficient sampling. However, a crucial limitation of NFs is their substantial memory requirements, arising from maintaining the dimension of the latent space equal to that of the input space. Multi-scale architectures bypass this limitation by progressively reducing the dimension of latent variables while ensuring reversibility. Existing multi-scale architectures split the latent variables in a simple, static manner at the channel level, compromising NFs' expressive power. To address this issue, we propose a regularized and feature-dependent $\mathtt{Shuffle}$ operation and integrate it into vanilla multi-scale architecture. This operation heuristically generates channel-wise weights and adaptively shuffles latent variables before splitting them with these weights. We observe that such operation guides the variables to evolve in the direction of entropy increase, hence we refer to NFs with the $\mathtt{Shuffle}$ operation as \emph{Entropy-Informed Weighting Channel Normalizing Flow} (EIW-Flow). Experimental results indicate that the EIW-Flow achieves state-of-the-art density estimation results and comparable sample quality on CIFAR-10, CelebA and ImageNet datasets, with negligible additional computational overhead.




Abstract:High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore artificial intelligence (AI)-based beam-level and cell-level mobility management suitable for HSR communications, including the use cases, inputs, outputs, and key performance indicators (KPI)s of AI models. Particularly, in comparison to traditional down-sampling spatial beam measurements, we show that the compressed spatial multi-beam measurements via compressive sensing lead to improved spatial-temporal beam prediction. Moreover, we demonstrate the performance gains of AI-assisted cell handover over traditional mobile handover mechanisms. In addition, we observe that the proposed approaches to reduce the measurement overhead achieve comparable radio link failure performance with the traditional approach that requires all the beam measurements of all cells, while the former methods can save 50% beam measurement overhead.