Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liang Chen

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Oct 03, 2023
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Tianyu Liu, Baobao Chang

Figure 1 for Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Figure 2 for Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Figure 3 for Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Figure 4 for Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research.

* 18 pages, 10 figures

Via

Access Paper or Ask Questions

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Oct 02, 2023
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

Figure 1 for MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Figure 2 for MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Figure 3 for MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Figure 4 for MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing MMICL, a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context.

* Code, dataset, checkpoints, and demos are available at https://github.com/PKUnlp-icler/MIC

Via

Access Paper or Ask Questions

Meta Semantic Template for Evaluation of Large Language Models

Oct 01, 2023
Yachuan Liu, Liang Chen, Jindong Wang, Qiaozhu Mei, Xing Xie

Figure 1 for Meta Semantic Template for Evaluation of Large Language Models

Figure 2 for Meta Semantic Template for Evaluation of Large Language Models

Do large language models (LLMs) genuinely understand the semantics of the language, or just memorize the training data? The recent concern on potential data contamination of LLMs has raised awareness of the community to conduct research on LLMs evaluation. In this paper, we propose MSTemp, an approach that creates meta semantic templates to evaluate the semantic understanding ability of LLMs. The core of MSTemp is not to perform evaluation directly on existing benchmark datasets, but to generate new out-of-distribution (OOD) evaluation sets using existing datasets as seeds. Specifically, for a given sentence, MSTemp leverages another language model to generate new samples while preserving its semantics. The new samples are called semantic templates to the original sentence. Then, MSTemp generates evaluation samples via sentence parsing and random word replacement on the semantic templates. MSTemp is highly flexible, dynamic, and cost-effective. Our initial experiments show that MSTemp-generated samples can significantly reduce the performance of LLMs using existing datasets as seeds. We hope this initial work can shed light on future research of LLMs evaluation.

* Work in progress; 7 pages; more work at: https://llm-eval.github.io/

Via

Access Paper or Ask Questions

Making Large Language Models Better Reasoners with Alignment

Sep 05, 2023
Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, Zhifang Sui

Figure 1 for Making Large Language Models Better Reasoners with Alignment

Figure 2 for Making Large Language Models Better Reasoners with Alignment

Figure 3 for Making Large Language Models Better Reasoners with Alignment

Figure 4 for Making Large Language Models Better Reasoners with Alignment

Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an \textit{Assessment Misalignment} problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an \textit{Alignment Fine-Tuning (AFT)} paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by LLMs with a novel constraint alignment loss. Specifically, the constraint alignment loss has two objectives: a) Alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality COTs; b) Constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. Beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT.

* Large Language Models; Reasoning; Alignment

Via

Access Paper or Ask Questions

Domain Generalization via Rationale Invariance

Aug 22, 2023
Liang Chen, Yong Zhang, Yibing Song, Anton van den Hengel, Lingqiao Liu

Figure 1 for Domain Generalization via Rationale Invariance

Figure 2 for Domain Generalization via Rationale Invariance

Figure 3 for Domain Generalization via Rationale Invariance

Figure 4 for Domain Generalization via Rationale Invariance

This paper offers a new perspective to ease the challenge of domain generalization, which involves maintaining robust results even in unseen environments. Our design focuses on the decision-making process in the final classifier layer. Specifically, we propose treating the element-wise contributions to the final results as the rationale for making a decision and representing the rationale for each sample as a matrix. For a well-generalized model, we suggest the rationale matrices for samples belonging to the same category should be similar, indicating the model relies on domain-invariant clues to make decisions, thereby ensuring robust results. To implement this idea, we introduce a rationale invariance loss as a simple regularization technique, requiring only a few lines of code. Our experiments demonstrate that the proposed approach achieves competitive results across various datasets, despite its simplicity. Code is available at \url{https://github.com/liangchen527/RIDG}.

* Accepted in ICCV 2023

Via

Access Paper or Ask Questions

SAILOR: Structural Augmentation Based Tail Node Representation Learning

Aug 15, 2023
Jie Liao, Jintang Li, Liang Chen, Bingzhe Wu, Yatao Bian, Zibin Zheng

Figure 1 for SAILOR: Structural Augmentation Based Tail Node Representation Learning

Figure 2 for SAILOR: Structural Augmentation Based Tail Node Representation Learning

Figure 3 for SAILOR: Structural Augmentation Based Tail Node Representation Learning

Figure 4 for SAILOR: Structural Augmentation Based Tail Node Representation Learning

Graph Neural Networks (GNNs) have achieved state-of-the-art performance in representation learning for graphs recently. However, the effectiveness of GNNs, which capitalize on the key operation of message propagation, highly depends on the quality of the topology structure. Most of the graphs in real-world scenarios follow a long-tailed distribution on their node degrees, that is, a vast majority of the nodes in the graph are tail nodes with only a few connected edges. GNNs produce inferior node representations for tail nodes since they lack structural information. In the pursuit of promoting the expressiveness of GNNs for tail nodes, we explore how the deficiency of structural information deteriorates the performance of tail nodes and propose a general Structural Augmentation based taIL nOde Representation learning framework, dubbed as SAILOR, which can jointly learn to augment the graph structure and extract more informative representations for tail nodes. Extensive experiments on public benchmark datasets demonstrate that SAILOR can significantly improve the tail node representations and outperform the state-of-the-art baselines.

* Accepted by CIKM 2023; Code is available at https://github.com/Jie-Re/SAILOR

Via

Access Paper or Ask Questions

A Self-supervised SAR Image Despeckling Strategy Based on Parameter-sharing Convolutional Neural Networks

Aug 11, 2023
Liang Chen, Yifei Yin, Hao Shi, Qingqing Sheng, Wei Li

Figure 1 for A Self-supervised SAR Image Despeckling Strategy Based on Parameter-sharing Convolutional Neural Networks

Figure 2 for A Self-supervised SAR Image Despeckling Strategy Based on Parameter-sharing Convolutional Neural Networks

Figure 3 for A Self-supervised SAR Image Despeckling Strategy Based on Parameter-sharing Convolutional Neural Networks

Figure 4 for A Self-supervised SAR Image Despeckling Strategy Based on Parameter-sharing Convolutional Neural Networks

Speckle noise is generated due to the SAR imaging mechanism, which brings difficulties in SAR image interpretation. Hence, despeckling is a helpful step in SAR pre-processing. Nowadays, deep learning has been proved to be a progressive method for SAR image despeckling. Most deep learning methods for despeckling are based on supervised learning, which needs original SAR images and speckle-free SAR images to train the network. However, the speckle-free SAR images are generally not available. So, this issue was tackled by adding multiplicative noise to optical images synthetically for simulating speckled image. Therefore, there are following challenges in SAR image despeckling: (1) lack of speckle-free SAR image; (2) difficulty in keeping details such as edges and textures in heterogeneous areas. To address these issues, we propose a self-supervised SAR despeckling strategy that can be trained without speckle-free images. Firstly, the feasibility of SAR image despeckling without speckle-free images is proved theoretically. Then, the sub-sampler based on the adjacent-syntropy criteria is proposed. The training image pairs are generated by the sub-sampler from real-word SAR image to estimate the noise distribution. Furthermore, to make full use of training pairs, the parameter sharing convolutional neural networks are adopted. Finally, according to the characteristics of SAR images, a multi-feature loss function is proposed. The proposed loss function is composed of despeckling term, regular term and perception term, to constrain the gap between the generated paired images. The ability of edge and texture feature preserving is improved simultaneously. Finally, qualitative and quantitative experiments are validated on real-world SAR images, showing better performances than several advanced SAR image despeckling methods.

Via

Access Paper or Ask Questions

Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Aug 10, 2023
Liang Chen, Jiawei Zhang, Zhenhua Li, Yunxuan Wei, Faming Fang, Jimmy Ren, Jinshan Pan

Figure 1 for Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Figure 2 for Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Figure 3 for Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Figure 4 for Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Images taken under the low-light condition often contain blur and saturated pixels at the same time. Deblurring images with saturated pixels is quite challenging. Because of the limited dynamic range, the saturated pixels are usually clipped in the imaging process and thus cannot be modeled by the linear blur model. Previous methods use manually designed smooth functions to approximate the clipping procedure. Their deblurring processes often require empirically defined parameters, which may not be the optimal choices for different images. In this paper, we develop a data-driven approach to model the saturated pixels by a learned latent map. Based on the new model, the non-blind deblurring task can be formulated into a maximum a posterior (MAP) problem, which can be effectively solved by iteratively computing the latent map and the latent image. Specifically, the latent map is computed by learning from a map estimation network (MEN), and the latent image estimation process is implemented by a Richardson-Lucy (RL)-based updating scheme. To estimate high-quality deblurred images without amplified artifacts, we develop a prior estimation network (PEN) to obtain prior information, which is further integrated into the RL scheme. Experimental results demonstrate that the proposed method performs favorably against state-of-the-art algorithms both quantitatively and qualitatively on synthetic and real-world images.

* Accepted by IJCV

Via

Access Paper or Ask Questions

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Jun 08, 2023
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu

Figure 1 for M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Figure 2 for M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Figure 3 for M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Figure 4 for M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions. Our M$^3$IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M$^3$IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M$^3$IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. We have open-sourced the dataset to encourage further research.

* Fix dataset url: https://huggingface.co/datasets/MMInstruction/M3IT Project: https://m3-it.github.io/

Via

Access Paper or Ask Questions