Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lei Li

Carnegie Mellon University

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Jun 08, 2023

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun(+2 more)

Abstract:Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions. Our M$^3$IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M$^3$IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M$^3$IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. We have open-sourced the dataset to encourage further research.

* Fix dataset url: https://huggingface.co/datasets/MMInstruction/M3IT Project: https://m3-it.github.io/

Via

Access Paper or Ask Questions

Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats

Jun 02, 2023

Xuandong Zhao, Kexun Zhang, Yu-Xiang Wang, Lei Li

Abstract:Invisible watermarks safeguard images' copyrights by embedding hidden messages detectable by owners. It also prevents people from misusing images, especially those generated by AI models. Malicious adversaries can violate these rights by removing the watermarks. In order to remove watermarks without damaging the visual quality, the adversary needs to erase them while retaining the essential information in the image. This is analogous to the encoding and decoding process of generative autoencoders, especially variational autoencoders (VAEs) and diffusion models. We propose a framework using generative autoencoders to remove invisible watermarks and test it using VAEs and diffusions. Our results reveal that, even without specific training, off-the-shelf Stable Diffusion effectively removes most watermarks, surpassing all current attackers. The result underscores the vulnerabilities in existing watermarking schemes and calls for more robust methods for copyright protection.

Via

Access Paper or Ask Questions

Large Language Models are not Fair Evaluators

May 29, 2023

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui

Figure 1 for Large Language Models are not Fair Evaluators

Figure 2 for Large Language Models are not Fair Evaluators

Figure 3 for Large Language Models are not Fair Evaluators

Figure 4 for Large Language Models are not Fair Evaluators

Abstract:We uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit \emph{FairEval}, along with the human annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}

* work in progress

Via

Access Paper or Ask Questions

Neural Machine Translation with Dynamic Graph Convolutional Decoder

May 28, 2023

Lei Li, Kai Fan, Lingyu Yang, Hongjia Li, Chun Yuan

Figure 1 for Neural Machine Translation with Dynamic Graph Convolutional Decoder

Figure 2 for Neural Machine Translation with Dynamic Graph Convolutional Decoder

Figure 3 for Neural Machine Translation with Dynamic Graph Convolutional Decoder

Figure 4 for Neural Machine Translation with Dynamic Graph Convolutional Decoder

Abstract:Existing wisdom demonstrates the significance of syntactic knowledge for the improvement of neural machine translation models. However, most previous works merely focus on leveraging the source syntax in the well-known encoder-decoder framework. In sharp contrast, this paper proposes an end-to-end translation architecture from the (graph \& sequence) structural inputs to the (graph \& sequence) outputs, where the target translation and its corresponding syntactic graph are jointly modeled and generated. We propose a customized Dynamic Spatial-Temporal Graph Convolutional Decoder (Dyn-STGCD), which is designed for consuming source feature representations and their syntactic graph, and auto-regressively generating the target syntactic graph and tokens simultaneously. We conduct extensive experiments on five widely acknowledged translation benchmarks, verifying that our proposal achieves consistent improvements over baselines and other syntax-aware variants.

Via

Access Paper or Ask Questions

ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

May 24, 2023

Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, Zhifang Sui

Figure 1 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Figure 2 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Figure 3 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Figure 4 for ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Abstract:Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.

Via

Access Paper or Ask Questions

ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

May 24, 2023

Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, Lei Li

Figure 1 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Figure 2 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Figure 3 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Figure 4 for ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Abstract:Large language models (LLMs) excel at implementing code from functionality descriptions, but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the creation and verify their correctness. ALGO first generates a probably correct but possibly slow reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the algorithms synthesized. Our study shows that the LLM-generated oracles are correct for 88% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code Interpreter on unseen problems.

Via

Access Paper or Ask Questions

Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

May 24, 2023

Siqi Ouyang, Lei Li

Figure 1 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Figure 2 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Figure 3 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Figure 4 for Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Abstract:Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.

* Draft. Work in Progress

Via

Access Paper or Ask Questions

Can Language Models Understand Physical Concepts?

May 23, 2023

Lei Li, Jingjing Xu, Qingxiu Dong, Ce Zheng, Qi Liu, Lingpeng Kong, Xu Sun

Abstract:Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85\% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented LMs such as CLIP and BLIP achieve a human-level understanding of embodied concepts. Analysis indicates that the rich semantics in visual representation can serve as a valuable source of embodied knowledge. Inspired by this, we propose a distillation method to transfer embodied knowledge from VLMs to LMs, achieving performance gain comparable with that by scaling up the parameters of LMs 134x. Our dataset is available at \url{https://github.com/TobiasLee/VEC}

Via

Access Paper or Ask Questions

Learn from Mistakes through Cooperative Interaction with Study Assistant

May 23, 2023

Danqing Wang, Lei Li

Figure 1 for Learn from Mistakes through Cooperative Interaction with Study Assistant

Figure 2 for Learn from Mistakes through Cooperative Interaction with Study Assistant

Figure 3 for Learn from Mistakes through Cooperative Interaction with Study Assistant

Figure 4 for Learn from Mistakes through Cooperative Interaction with Study Assistant

Abstract:Large language models have demonstrated their ability to self-reflect and refine their generation, which can further improve their performance. However, this feedback mechanism faces challenges such as no guarantee of correctness and the lack of global insight into the model's weaknesses. In this paper, we propose a novel framework, Study Assistant for Large Language Model (SALAM), to aid LLMs in the reflection and refinement process. Motivated by the human study assistant, this framework grades previous responses with the ground truth and collects mistakes in the training phase. During inference, it identifies common misunderstandings based on the mistake collections and provides guidelines for the model to help the model avoid similar mistakes during inference. SALAM is a model-agnostic framework, focusing on providing general feedback and can adapt to any base model. Our evaluation of SALAM on two challenging benchmarks demonstrated a significant improvement over various baselines.

Via

Access Paper or Ask Questions

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

May 23, 2023

Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li

Figure 1 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Figure 2 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Figure 3 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Figure 4 for INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Abstract:The field of automatic evaluation of text generation made tremendous progress in the last few years. In particular, since the advent of neural metrics, like COMET, BLEURT, and SEScore2, the newest generation of metrics show a high correlation with human judgment. Unfortunately, quality scores generated with neural metrics are not interpretable, and it is unclear which part of the generation output is criticized by the metrics. To address this limitation, we present INSTRUCTSCORE, an open-source, explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT4, we fine-tune a LLAMA model to create an evaluative metric that can produce a diagnostic report aligned with human judgment. We evaluate INSTRUCTSCORE on the WMT22 Zh-En translation task, where our 7B model surpasses other LLM-based baselines, including those based on 175B GPT3. Impressively, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which was fine-tuned on human ratings.

* Work in progress

Via

Access Paper or Ask Questions