Alert button
Picture for Cong Yu

Cong Yu

Alert button

Are Large Language Models Really Robust to Word-Level Perturbations?

Sep 20, 2023
Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao

The swift advancement in the scale and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the robustness of LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Our extensive empirical experiments have demonstrated that TREval provides an accurate method for evaluating the robustness of an LLM, especially when faced with more challenging open questions. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations, which are commonplace in daily language usage. Notably, we were surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREval.

Viaarxiv icon

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Nov 03, 2022
Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu

Figure 1 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 2 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 3 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 4 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.

* Accepted at NeurIPS 2022 
Viaarxiv icon

All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass

May 22, 2022
Jiaxin Huang, Tianqi Liu, Jialu Liu, Adam D. Lelkes, Cong Yu, Jiawei Han

Figure 1 for All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass
Figure 2 for All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass
Figure 3 for All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass
Figure 4 for All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass

Multi-Task Learning (MTL) models have shown their robustness, effectiveness, and efficiency for transferring learned knowledge across tasks. In real industrial applications such as web content classification, multiple classification tasks are predicted from the same input text such as a web article. However, at the serving time, the existing multitask transformer models such as prompt or adaptor based approaches need to conduct N forward passes for N tasks with O(N) computation cost. To tackle this problem, we propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass. To illustrate real application usage, we release a multitask dataset on news topic and style classification. Our experiments show that our proposed method outperforms strong baselines on both the GLUE benchmark and our news dataset. Our code and dataset are publicly available at https://bit.ly/mtop-code.

Viaarxiv icon

Image Steganography based on Style Transfer

Mar 09, 2022
Donghui Hu, Yu Zhang, Cong Yu, Jian Wang, Yaofei Wang

Figure 1 for Image Steganography based on Style Transfer
Figure 2 for Image Steganography based on Style Transfer
Figure 3 for Image Steganography based on Style Transfer
Figure 4 for Image Steganography based on Style Transfer

Image steganography is the art and science of using images as cover for covert communications. With the development of neural networks, traditional image steganography is more likely to be detected by deep learning-based steganalysis. To improve upon this, we propose image steganography network based on style transfer, and the embedding of secret messages can be disguised as image stylization. We embed secret information while transforming the content image style. In latent space, the secret information is integrated into the latent representation of the cover image to generate the stego images, which are indistinguishable from normal stylized images. It is an end-to-end unsupervised model without pre-training. Extensive experiments on the benchmark dataset demonstrate the reliability, quality and security of stego images generated by our steganographic network.

Viaarxiv icon

RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals

Jan 25, 2022
Zhi Wu, Dongheng Zhang, Chunyang Xie, Cong Yu, Jinbo Chen, Yang Hu, Yan Chen

Figure 1 for RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
Figure 2 for RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
Figure 3 for RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
Figure 4 for RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals

Human silhouette segmentation, which is originally defined in computer vision, has achieved promising results for understanding human activities. However, the physical limitation makes existing systems based on optical cameras suffer from severe performance degradation under low illumination, smoke, and/or opaque obstruction conditions. To overcome such limitations, in this paper, we propose to utilize the radio signals, which can traverse obstacles and are unaffected by the lighting conditions to achieve silhouette segmentation. The proposed RFMask framework is composed of three modules. It first transforms RF signals captured by millimeter wave radar on two planes into spatial domain and suppress interference with the signal processing module. Then, it locates human reflections on RF frames and extract features from surrounding signals with human detection module. Finally, the extracted features from RF frames are aggregated with an attention based mask generation module. To verify our proposed framework, we collect a dataset containing 804,760 radio frames and 402,380 camera frames with human activities under various scenes. Experimental results show that the proposed framework can achieve impressive human silhouette segmentation even under the challenging scenarios(such as low light and occlusion scenarios) where traditional optical-camera-based methods fail. To the best of our knowledge, this is the first investigation towards segmenting human silhouette based on millimeter wave signals. We hope that our work can serve as a baseline and inspire further research that perform vision tasks with radio signals. The dataset and codes will be made in public.

Viaarxiv icon

RFGAN: RF-Based Human Synthesis

Dec 07, 2021
Cong Yu, Zhi Wu, Dongheng Zhang, Zhi Lu, Yang Hu, Yan Chen

Figure 1 for RFGAN: RF-Based Human Synthesis
Figure 2 for RFGAN: RF-Based Human Synthesis
Figure 3 for RFGAN: RF-Based Human Synthesis
Figure 4 for RFGAN: RF-Based Human Synthesis

This paper demonstrates human synthesis based on the Radio Frequency (RF) signals, which leverages the fact that RF signals can record human movements with the signal reflections off the human body. Different from existing RF sensing works that can only perceive humans roughly, this paper aims to generate fine-grained optical human images by introducing a novel cross-modal RFGAN model. Specifically, we first build a radio system equipped with horizontal and vertical antenna arrays to transceive RF signals. Since the reflected RF signals are processed as obscure signal projection heatmaps on the horizontal and vertical planes, we design a RF-Extractor with RNN in RFGAN for RF heatmap encoding and combining to obtain the human activity information. Then we inject the information extracted by the RF-Extractor and RNN as the condition into GAN using the proposed RF-based adaptive normalizations. Finally, we train the whole model in an end-to-end manner. To evaluate our proposed model, we create two cross-modal datasets (RF-Walk & RF-Activity) that contain thousands of optical human activity frames and corresponding RF signals. Experimental results show that the RFGAN can generate target human activity frames using RF signals. To the best of our knowledge, this is the first work to generate optical images based on RF signals.

Viaarxiv icon

ReasonBERT: Pre-trained to Reason with Distant Supervision

Sep 10, 2021
Xiang Deng, Yu Su, Alyssa Lees, You Wu, Cong Yu, Huan Sun

Figure 1 for ReasonBERT: Pre-trained to Reason with Distant Supervision
Figure 2 for ReasonBERT: Pre-trained to Reason with Distant Supervision
Figure 3 for ReasonBERT: Pre-trained to Reason with Distant Supervision
Figure 4 for ReasonBERT: Pre-trained to Reason with Distant Supervision

We present ReasonBert, a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid contexts. Unlike existing pre-training methods that only harvest learning signals from local contexts of naturally occurring texts, we propose a generalized notion of distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. We conduct a comprehensive evaluation on a variety of extractive question answering datasets ranging from single-hop to multi-hop and from text-only to table-only to hybrid that require various reasoning capabilities and show that ReasonBert achieves remarkable improvement over an array of strong baselines. Few-shot experiments further demonstrate that our pre-training method substantially improves sample efficiency.

* Accepted to EMNLP'2021. Our code and pre-trained models are available at https://github.com/sunlab-osu/ReasonBERT 
Viaarxiv icon

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Jul 02, 2021
Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

Figure 1 for Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Figure 2 for Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Figure 3 for Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Figure 4 for Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

Viaarxiv icon

NewsEmbed: Modeling News through Pre-trained Document Representations

Jun 05, 2021
Jialu Liu, Tianqi Liu, Cong Yu

Figure 1 for NewsEmbed: Modeling News through Pre-trained Document Representations
Figure 2 for NewsEmbed: Modeling News through Pre-trained Document Representations
Figure 3 for NewsEmbed: Modeling News through Pre-trained Document Representations
Figure 4 for NewsEmbed: Modeling News through Pre-trained Document Representations

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

* Accepted in SIGKDD 2021 
Viaarxiv icon