Alert button
Picture for Lei Li

Lei Li

Alert button

Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats

Jun 02, 2023
Xuandong Zhao, Kexun Zhang, Yu-Xiang Wang, Lei Li

Figure 1 for Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats
Figure 2 for Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats
Figure 3 for Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats
Figure 4 for Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats

Invisible watermarks safeguard images' copyrights by embedding hidden messages detectable by owners. It also prevents people from misusing images, especially those generated by AI models. Malicious adversaries can violate these rights by removing the watermarks. In order to remove watermarks without damaging the visual quality, the adversary needs to erase them while retaining the essential information in the image. This is analogous to the encoding and decoding process of generative autoencoders, especially variational autoencoders (VAEs) and diffusion models. We propose a framework using generative autoencoders to remove invisible watermarks and test it using VAEs and diffusions. Our results reveal that, even without specific training, off-the-shelf Stable Diffusion effectively removes most watermarks, surpassing all current attackers. The result underscores the vulnerabilities in existing watermarking schemes and calls for more robust methods for copyright protection.

Viaarxiv icon

Large Language Models are not Fair Evaluators

May 29, 2023
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui

Figure 1 for Large Language Models are not Fair Evaluators
Figure 2 for Large Language Models are not Fair Evaluators
Figure 3 for Large Language Models are not Fair Evaluators
Figure 4 for Large Language Models are not Fair Evaluators

We uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit \emph{FairEval}, along with the human annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}

* work in progress 
Viaarxiv icon

Neural Machine Translation with Dynamic Graph Convolutional Decoder

May 28, 2023
Lei Li, Kai Fan, Lingyu Yang, Hongjia Li, Chun Yuan

Figure 1 for Neural Machine Translation with Dynamic Graph Convolutional Decoder
Figure 2 for Neural Machine Translation with Dynamic Graph Convolutional Decoder
Figure 3 for Neural Machine Translation with Dynamic Graph Convolutional Decoder
Figure 4 for Neural Machine Translation with Dynamic Graph Convolutional Decoder

Existing wisdom demonstrates the significance of syntactic knowledge for the improvement of neural machine translation models. However, most previous works merely focus on leveraging the source syntax in the well-known encoder-decoder framework. In sharp contrast, this paper proposes an end-to-end translation architecture from the (graph \& sequence) structural inputs to the (graph \& sequence) outputs, where the target translation and its corresponding syntactic graph are jointly modeled and generated. We propose a customized Dynamic Spatial-Temporal Graph Convolutional Decoder (Dyn-STGCD), which is designed for consuming source feature representations and their syntactic graph, and auto-regressively generating the target syntactic graph and tokens simultaneously. We conduct extensive experiments on five widely acknowledged translation benchmarks, verifying that our proposal achieves consistent improvements over baselines and other syntax-aware variants.

Viaarxiv icon

Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

May 24, 2023
Siqi Ouyang, Lei Li

Figure 1 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations
Figure 2 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations
Figure 3 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations
Figure 4 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.

* Draft. Work in Progress 
Viaarxiv icon

ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

May 24, 2023
Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, Zhifang Sui

Figure 1 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Figure 2 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Figure 3 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Figure 4 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.

Viaarxiv icon

ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

May 24, 2023
Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, Lei Li

Figure 1 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers
Figure 2 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers
Figure 3 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers
Figure 4 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Large language models (LLMs) excel at implementing code from functionality descriptions, but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the creation and verify their correctness. ALGO first generates a probably correct but possibly slow reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the algorithms synthesized. Our study shows that the LLM-generated oracles are correct for 88% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code Interpreter on unseen problems.

Viaarxiv icon

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

May 23, 2023
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li

Figure 1 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Figure 2 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Figure 3 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Figure 4 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

The field of automatic evaluation of text generation made tremendous progress in the last few years. In particular, since the advent of neural metrics, like COMET, BLEURT, and SEScore2, the newest generation of metrics show a high correlation with human judgment. Unfortunately, quality scores generated with neural metrics are not interpretable, and it is unclear which part of the generation output is criticized by the metrics. To address this limitation, we present INSTRUCTSCORE, an open-source, explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT4, we fine-tune a LLAMA model to create an evaluative metric that can produce a diagnostic report aligned with human judgment. We evaluate INSTRUCTSCORE on the WMT22 Zh-En translation task, where our 7B model surpasses other LLM-based baselines, including those based on 175B GPT3. Impressively, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which was fine-tuned on human ratings.

* Work in progress 
Viaarxiv icon

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

May 23, 2023
Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

Figure 1 for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Figure 2 for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Figure 3 for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Figure 4 for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided context remains under-explored. In this paper, we investigate the working mechanism of ICL through an information flow lens. Our findings reveal that label words in the demonstration examples function as anchors: (1) semantic information aggregates into label word representations during the shallow computation layers' processing; (2) the consolidated information in label words serves as a reference for LLMs' final predictions. Based on these insights, we introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL. The promising applications of our findings again validate the uncovered ICL working mechanism and pave the way for future studies.

Viaarxiv icon

Can Language Models Understand Physical Concepts?

May 23, 2023
Lei Li, Jingjing Xu, Qingxiu Dong, Ce Zheng, Qi Liu, Lingpeng Kong, Xu Sun

Figure 1 for Can Language Models Understand Physical Concepts?
Figure 2 for Can Language Models Understand Physical Concepts?
Figure 3 for Can Language Models Understand Physical Concepts?
Figure 4 for Can Language Models Understand Physical Concepts?

Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85\% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented LMs such as CLIP and BLIP achieve a human-level understanding of embodied concepts. Analysis indicates that the rich semantics in visual representation can serve as a valuable source of embodied knowledge. Inspired by this, we propose a distillation method to transfer embodied knowledge from VLMs to LMs, achieving performance gain comparable with that by scaling up the parameters of LMs 134x. Our dataset is available at \url{https://github.com/TobiasLee/VEC}

Viaarxiv icon

Learn from Mistakes through Cooperative Interaction with Study Assistant

May 23, 2023
Danqing Wang, Lei Li

Figure 1 for Learn from Mistakes through Cooperative Interaction with Study Assistant
Figure 2 for Learn from Mistakes through Cooperative Interaction with Study Assistant
Figure 3 for Learn from Mistakes through Cooperative Interaction with Study Assistant
Figure 4 for Learn from Mistakes through Cooperative Interaction with Study Assistant

Large language models have demonstrated their ability to self-reflect and refine their generation, which can further improve their performance. However, this feedback mechanism faces challenges such as no guarantee of correctness and the lack of global insight into the model's weaknesses. In this paper, we propose a novel framework, Study Assistant for Large Language Model (SALAM), to aid LLMs in the reflection and refinement process. Motivated by the human study assistant, this framework grades previous responses with the ground truth and collects mistakes in the training phase. During inference, it identifies common misunderstandings based on the mistake collections and provides guidelines for the model to help the model avoid similar mistakes during inference. SALAM is a model-agnostic framework, focusing on providing general feedback and can adapt to any base model. Our evaluation of SALAM on two challenging benchmarks demonstrated a significant improvement over various baselines.

Viaarxiv icon