Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hewu Li

Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Aug 25, 2025

Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu

Figure 1 for Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Figure 2 for Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Figure 3 for Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Figure 4 for Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Abstract:Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.

Via

Access Paper or Ask Questions

An Engorgio Prompt Makes Large Language Model Babble on

Dec 27, 2024

Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Han Qiu, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu

Figure 1 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 2 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 3 for An Engorgio Prompt Makes Large Language Model Babble on

Figure 4 for An Engorgio Prompt Makes Large Language Model Babble on

Abstract:Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. However, the new paradigm of these LLMs also exposes novel threats. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts to intentionally increase the computation cost and latency of the inference process. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM's service availability. Engorgio has the following two technical contributions. (1) We employ a parameterized distribution to track LLMs' prediction trajectory. (2) Targeting the auto-regressive nature of LLMs' inference process, we propose novel loss functions to stably suppress the appearance of the <EOS> token, whose occurrence will interrupt the LLM's generation process. We conduct extensive experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13$\times$ longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio's threat to LLM service with limited computing resources. The code is accessible at https://github.com/jianshuod/Engorgio-prompt.

Via

Access Paper or Ask Questions

COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

Oct 02, 2024

Ziyuan Zhang, Han Qiu, Maosen Zhang, Jun Liu, Bin Chen, Tianwei Zhang, Hewu Li

Figure 1 for COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

Figure 2 for COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

Figure 3 for COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

Figure 4 for COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

Abstract:With the rapidly increasing number of satellites in space and their enhanced capabilities, the amount of earth observation images collected by satellites is exceeding the transmission limits of satellite-to-ground links. Although existing learned image compression solutions achieve remarkable performance by using a sophisticated encoder to extract fruitful features as compression and using a decoder to reconstruct, it is still hard to directly deploy those complex encoders on current satellites' embedded GPUs with limited computing capability and power supply to compress images in orbit. In this paper, we propose COSMIC, a simple yet effective learned compression solution to transmit satellite images. We first design a lightweight encoder (i.e. reducing FLOPs by $2.6\sim 5\times $) on satellite to achieve a high image compression ratio to save satellite-to-ground links. Then, for reconstructions on the ground, to deal with the feature extraction ability degradation due to simplifying encoders, we propose a diffusion-based model to compensate image details when decoding. Our insight is that satellite's earth observation photos are not just images but indeed multi-modal data with a nature of Text-to-Image pairing since they are collected with rich sensor data (e.g. coordinates, timestamp, etc.) that can be used as the condition for diffusion generation. Extensive experiments show that COSMIC outperforms state-of-the-art baselines on both perceptual and distortion metrics.

Via

Access Paper or Ask Questions

In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

Jul 10, 2024

Weisen Liu, Zeqi Lai, Qian Wu, Hewu Li, Qi Zhang, Zonglun Li, Yuanjie Li, Jun Liu

Figure 1 for In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

Figure 2 for In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

Figure 3 for In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

Figure 4 for In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

Abstract:With the rapid evolution of space-borne capabilities, space edge computing (SEC) is becoming a new computation paradigm for future integrated space and terrestrial networks. Satellite edges adopt advanced on-board hardware, which not only enables new opportunities to perform complex intelligent tasks in orbit, but also involves new challenges due to the additional energy consumption in power-constrained space environment. In this paper, we present PHOENIX, an energy-efficient task scheduling framework for emerging SEC networks. PHOENIX exploits a key insight that in the SEC network, there always exist a number of sunlit edges which are illuminated during the entire orbital period and have sufficient energy supplement from the sun. PHOENIX accomplishes energy-efficient in-orbit computing by judiciously offloading space tasks to "sunlight-sufficient" edges or to the ground. Specifically, PHOENIX first formulates the SEC battery energy optimizing (SBEO) problem which aims at minimizing the average battery energy consumption while satisfying various task completion constraints. Then PHOENIX incorporates a sunlight-aware scheduling mechanism to solve the SBEO problem and schedule SEC tasks efficiently. Finally, we implement a PHOENIX prototype and build an SEC testbed. Extensive data-driven evaluations demonstrate that as compared to other state-of-the-art solutions, PHOENIX can effectively reduce up to 54.8% SEC battery energy consumption and prolong battery lifetime to 2.9$\times$ while still completing tasks on time.

* Accepted by IEEE INFOCOM 2024

Via

Access Paper or Ask Questions

Mind Your Heart: Stealthy Backdoor Attack on Dynamic Deep Neural Network in Edge Computing

Dec 22, 2022

Tian Dong, Ziyuan Zhang, Han Qiu, Tianwei Zhang, Hewu Li, Terry Wang

Abstract:Transforming off-the-shelf deep neural network (DNN) models into dynamic multi-exit architectures can achieve inference and transmission efficiency by fragmenting and distributing a large DNN model in edge computing scenarios (e.g., edge devices and cloud servers). In this paper, we propose a novel backdoor attack specifically on the dynamic multi-exit DNN models. Particularly, we inject a backdoor by poisoning one DNN model's shallow hidden layers targeting not this vanilla DNN model but only its dynamically deployed multi-exit architectures. Our backdoored vanilla model behaves normally on performance and cannot be activated even with the correct trigger. However, the backdoor will be activated when the victims acquire this model and transform it into a dynamic multi-exit architecture at their deployment. We conduct extensive experiments to prove the effectiveness of our attack on three structures (ResNet-56, VGG-16, and MobileNet) with four datasets (CIFAR-10, SVHN, GTSRB, and Tiny-ImageNet) and our backdoor is stealthy to evade multiple state-of-the-art backdoor detection or removal methods.

* Accepted to IEEE INFOCOM 2023

Via

Access Paper or Ask Questions

Fingerprinting Multi-exit Deep Neural Network Models via Inference Time

Oct 07, 2021

Tian Dong, Han Qiu, Tianwei Zhang, Jiwei Li, Hewu Li, Jialiang Lu

Figure 1 for Fingerprinting Multi-exit Deep Neural Network Models via Inference Time

Figure 2 for Fingerprinting Multi-exit Deep Neural Network Models via Inference Time

Figure 3 for Fingerprinting Multi-exit Deep Neural Network Models via Inference Time

Figure 4 for Fingerprinting Multi-exit Deep Neural Network Models via Inference Time

Abstract:Transforming large deep neural network (DNN) models into the multi-exit architectures can overcome the overthinking issue and distribute a large DNN model on resource-constrained scenarios (e.g. IoT frontend devices and backend servers) for inference and transmission efficiency. Nevertheless, intellectual property (IP) protection for the multi-exit models in the wild is still an unsolved challenge. Previous efforts to verify DNN model ownership mainly rely on querying the model with specific samples and checking the responses, e.g., DNN watermarking and fingerprinting. However, they are vulnerable to adversarial settings such as adversarial training and are not suitable for the IP verification for multi-exit DNN models. In this paper, we propose a novel approach to fingerprint multi-exit models via inference time rather than inference predictions. Specifically, we design an effective method to generate a set of fingerprint samples to craft the inference process with a unique and robust inference time cost as the evidence for model ownership. We conduct extensive experiments to prove the uniqueness and robustness of our method on three structures (ResNet-56, VGG-16, and MobileNet) and three datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) under comprehensive adversarial settings.

Via

Access Paper or Ask Questions