Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chao Zhang

refer to the report for detailed contributions

Detecting Defective Wafers Via Modular Networks

Jan 06, 2025

Yifeng Zhang, Bryan Baker, Shi Chen, Chao Zhang, Yu Huang, Qi Zhao, Sthitie Bom

Figure 1 for Detecting Defective Wafers Via Modular Networks

Figure 2 for Detecting Defective Wafers Via Modular Networks

Figure 3 for Detecting Defective Wafers Via Modular Networks

Figure 4 for Detecting Defective Wafers Via Modular Networks

Abstract:The growing availability of sensors within semiconductor manufacturing processes makes it feasible to detect defective wafers with data-driven models. Without directly measuring the quality of semiconductor devices, they capture the modalities between diverse sensor readings and can be used to predict key quality indicators (KQI, \textit{e.g.}, roughness, resistance) to detect faulty products, significantly reducing the capital and human cost in maintaining physical metrology steps. Nevertheless, existing models pay little attention to the correlations among different processes for diverse wafer products and commonly struggle with generalizability issues. To enable generic fault detection, in this work, we propose a modular network (MN) trained using time series stage-wise datasets that embodies the structure of the manufacturing process. It decomposes KQI prediction as a combination of stage modules to simulate compositional semiconductor manufacturing, universally enhancing faulty wafer detection among different wafer types and manufacturing processes. Extensive experiments demonstrate the usefulness of our approach, and shed light on how the compositional design provides an interpretable interface for more practical applications.

Via

Access Paper or Ask Questions

FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Jan 05, 2025

Hui Lin, Chao Zhang, Danfeng Hong, Kexin Dong, Congcong Wen

Figure 1 for FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Figure 2 for FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Figure 3 for FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Figure 4 for FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Abstract:Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.

Via

Access Paper or Ask Questions

An Engorgio Prompt Makes Large Language Model Babble on

Dec 27, 2024

Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Han Qiu, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu

Figure 1 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 2 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 3 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 4 for An Engorgio Prompt Makes Large Language Model Babble on

Abstract:Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. However, the new paradigm of these LLMs also exposes novel threats. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts to intentionally increase the computation cost and latency of the inference process. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM's service availability. Engorgio has the following two technical contributions. (1) We employ a parameterized distribution to track LLMs' prediction trajectory. (2) Targeting the auto-regressive nature of LLMs' inference process, we propose novel loss functions to stably suppress the appearance of the <EOS> token, whose occurrence will interrupt the LLM's generation process. We conduct extensive experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13$\times$ longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio's threat to LLM service with limited computing resources. The code is accessible at https://github.com/jianshuod/Engorgio-prompt.

Via

Access Paper or Ask Questions

Space-Air-Ground Integrated Networks: Their Channel Model and Performance Analysis

Dec 21, 2024

Chao Zhang, Qingchao Li, Chao Xu, Lie-Liang Yang, Lajos Hanzo

Abstract:Given their extensive geographic coverage, low Earth orbit (LEO) satellites are envisioned to find their way into next-generation (6G) wireless communications. This paper explores space-air-ground integrated networks (SAGINs) leveraging LEOs to support terrestrial and non-terrestrial users. We first propose a practical satellite-ground channel model that incorporates five key aspects: 1) the small-scale fading characterized by the Shadowed-Rician distribution in terms of the Rician factor K, 2) the path loss effect of bending rays due to atmospheric refraction, 3) the molecular absorption modelled by the Beer-Lambert law, 4) the Doppler effects including the Earth's rotation, and 5) the impact of weather conditions according to the International Telecommunication Union Recommendations (ITU-R). Harnessing the proposed model, we analyze the long-term performance of the SAGIN considered. Explicitly, the closed-form expressions of both the outage probability and of the ergodic rates are derived. Additionally, the upper bounds of bit-error rates and of the Goodput are investigated. The numerical results yield the following insights: 1) The shadowing effect and the ratio between the line-of-sight and scattering components can be conveniently modeled by the factors of K and m in the proposed Shadowed-Rician small-scale fading model. 2) The atmospheric refraction has a modest effect on the path loss. 3) When calculating the transmission distance of waves, Earth's curvature and its geometric relationship with the satellites must be considered, particularly at small elevation angles. 3) High-frequency carriers suffer from substantial path loss, and 4) the Goodput metric is eminently suitable for characterizing the performance of different coding as well as modulation methods and of the estimation error of the Doppler effects.

Via

Access Paper or Ask Questions

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

Dec 12, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, Weiwei Xing

Abstract:We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.

Via

Access Paper or Ask Questions

Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

Dec 06, 2024

Sadegh Nadimi, Vasileios Angelidakis, Sadaf Maramizonouz, Chao Zhang

Figure 1 for Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

Figure 2 for Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

Figure 3 for Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

Figure 4 for Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

Abstract:We introduce OCULAR, an innovative hardware and software solution for three-dimensional dynamic image analysis of fine particles. Current state-of-the art instruments for dynamic image analysis are largely limited to two-dimensional imaging. However, extensive literature has demonstrated that relying on a single two-dimensional projection for particle characterisation can lead to inaccuracies in many applications. Existing three-dimensional imaging technologies, such as computed tomography, laser scanning, and orthophotography, are limited to static objects. These methods are often not statistically representative and come with significant post-processing requirements, as well as the need for specialised imaging and computing resources. OCULAR addresses these challenges by providing a cost-effective solution for imaging continuous particle streams using a synchronised array of optical cameras. Particle shape characterisation is achieved through the reconstruction of their three-dimensional surfaces. This paper details the OCULAR methodology, evaluates its reproducibility, and compares its results against X-ray micro computed tomography, highlighting its potential for efficient and reliable particle analysis.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment

Dec 04, 2024

Feng He, Chao Zhang, Zhixue Zhao

Abstract:Implicit assumptions and priors are often necessary in text-to-image generation tasks, especially when textual prompts lack sufficient context. However, these assumptions can sometimes reflect outdated concepts, inaccuracies, or societal bias embedded in the training data. We present Embedding-only Editing (Embedit), a method designed to efficiently adjust implict assumptions and priors in the model without affecting its interpretation of unrelated objects or overall performance. Given a "source" prompt (e.g., "rose") that elicits an implicit assumption (e.g., rose is red) and a "destination" prompt that specifies the desired attribute (e.g., "blue rose"), Embedit fine-tunes only the word token embedding (WTE) of the target object ("rose") to optimize the last hidden state of text encoder in Stable Diffusion, a SOTA text-to-image model. This targeted adjustment prevents unintended effects on other objects in the model's knowledge base, as the WTEs for unrelated objects and the model weights remain unchanged. Consequently, when a prompt does not contain the edited object, all representations, and the model outputs are identical to those of the original, unedited model. Our method is highly efficient, modifying only 768 parameters for Stable Diffusion 1.4 and 2048 for XL in a single edit, matching the WTE dimension of each respective model. This minimal scope, combined with rapid execution, makes Embedit highly practical for real-world applications. Additionally, changes are easily reversible by restoring the original WTE layers. Our experimental results demonstrate that Embedit consistently outperforms previous methods across various models, tasks, and editing scenarios (both single and sequential multiple edits), achieving at least a 6.01% improvement (from 87.17% to 93.18%).

Via

Access Paper or Ask Questions

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Nov 27, 2024

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

Figure 1 for SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Figure 2 for SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Figure 3 for SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Figure 4 for SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Abstract:Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

* Technical report

Via

Access Paper or Ask Questions

Self-Generated Critiques Boost Reward Modeling for Language Models

Nov 25, 2024

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang(+3 more)

Figure 1 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 2 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 3 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 4 for Self-Generated Critiques Boost Reward Modeling for Language Models

Abstract:Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.

* 20 pages

Via

Access Paper or Ask Questions

ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Nov 24, 2024

Suyuan Huang, Chao Zhang, Yuanyuan Wu, Haoxin Zhang, Yuan Wang, Maolin Wang, Shaosheng Cao, Tong Xu, Xiangyu Zhao, Zengchang Qin(+5 more)

Figure 1 for ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Figure 2 for ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Figure 3 for ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Figure 4 for ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

Abstract:Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.

Via

Access Paper or Ask Questions