Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ji Ma

Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games

Oct 28, 2024

Ji Ma

Abstract:As Large Language Model (LLM)-based agents increasingly undertake real-world tasks and engage with human society, how well do we understand their behaviors? This study (1) investigates how LLM agents' prosocial behaviors -- a fundamental social norm -- can be induced by different personas and benchmarked against human behaviors; and (2) introduces a behavioral approach to evaluate the performance of LLM agents in complex decision-making scenarios. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. Our findings reveal substantial variations and inconsistencies among LLMs and notable differences compared to human behaviors. Merely assigning a human-like identity to LLMs does not produce human-like behaviors. Despite being trained on extensive human-generated data, these AI agents cannot accurately predict human decisions. LLM agents are not able to capture the internal processes of human decision-making, and their alignment with human behavior is highly variable and dependent on specific model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern.

Via

Access Paper or Ask Questions

Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

Aug 10, 2024

Jiang Yuan, Ji Ma, Bo Wang, Weiming Hu

Figure 1 for Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

Figure 2 for Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

Figure 3 for Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

Figure 4 for Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

Abstract:Implicit degradation modeling-based blind super-resolution (SR) has attracted more increasing attention in the community due to its excellent generalization to complex degradation scenarios and wide application range. How to extract more discriminative degradation representations and fully adapt them to specific image features is the key to this task. In this paper, we propose a new Content-decoupled Contrastive Learning-based blind image super-resolution (CdCL) framework following the typical blind SR pipeline. This framework introduces negative-free contrastive learning technique for the first time to model the implicit degradation representation, in which a new cyclic shift sampling strategy is designed to ensure decoupling between content features and degradation features from the data perspective, thereby improving the purity and discriminability of the learned implicit degradation space. In addition, to improve the efficiency and effectiveness of implicit degradation-based blind super-resolving, we design a detail-aware implicit degradation adaption module with lower complexity, which adapts degradation information to the specific LR image from both channel and spatial perspectives. Extensive experiments on synthetic and real data prove that the proposed CdCL comprehensively improves the quantitative and qualitative results of contrastive learning-based implicit blind SR paradigm, and achieves SOTA PSNR in this field. Even if the number of parameters is halved, our method still achieves very competitive results.

Via

Access Paper or Ask Questions

AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Jul 24, 2024

Dongyang Xu, Qingfan Wang, Ji Ma, Xiangyun Zeng, Lei Chen

Figure 1 for AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Figure 2 for AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Figure 3 for AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Figure 4 for AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Abstract:Accurate driver attention prediction can serve as a critical reference for intelligent vehicles in understanding traffic scenes and making informed driving decisions. Though existing studies on driver attention prediction improved performance by incorporating advanced saliency detection techniques, they overlooked the opportunity to achieve human-inspired prediction by analyzing driving tasks from a cognitive science perspective. During driving, drivers' working memory and long-term memory play crucial roles in scene comprehension and experience retrieval, respectively. Together, they form situational awareness, facilitating drivers to quickly understand the current traffic situation and make optimal decisions based on past driving experiences. To explicitly integrate these two types of memory, this paper proposes an Adaptive Hybrid-Memory-Fusion (AHMF) driver attention prediction model to achieve more human-like predictions. Specifically, the model first encodes information about specific hazardous stimuli in the current scene to form working memories. Then, it adaptively retrieves similar situational experiences from the long-term memory for final prediction. Utilizing domain adaptation techniques, the model performs parallel training across multiple datasets, thereby enriching the accumulated driving experience within the long-term memory module. Compared to existing models, our model demonstrates significant improvements across various metrics on multiple public datasets, proving the effectiveness of integrating hybrid memories in driver attention prediction.

Via

Access Paper or Ask Questions

CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

Jun 30, 2024

Pengying Wu, Yao Mu, Kangjie Zhou, Ji Ma, Junting Chen, Chang Liu

Figure 1 for CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

Figure 2 for CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

Abstract:Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.

* Accepted to the RSS 2024 Workshop: GROUND

Via

Access Paper or Ask Questions

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

May 21, 2024

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

Abstract:Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

* Accepted by IJCAI-24

Via

Access Paper or Ask Questions

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Apr 29, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma(+25 more)

Figure 1 for How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Figure 2 for How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Figure 3 for How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Figure 4 for How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract:In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

* Technical report

Via

Access Paper or Ask Questions

ASPIRe: An Informative Trajectory Planner with Mutual Information Approximation for Target Search and Tracking

Mar 04, 2024

Kangjie Zhou, Pengying Wu, Yao Su, Han Gao, Ji Ma, Hangxin Liu, Chang Liu

Figure 1 for ASPIRe: An Informative Trajectory Planner with Mutual Information Approximation for Target Search and Tracking

Figure 2 for ASPIRe: An Informative Trajectory Planner with Mutual Information Approximation for Target Search and Tracking

Figure 3 for ASPIRe: An Informative Trajectory Planner with Mutual Information Approximation for Target Search and Tracking

Figure 4 for ASPIRe: An Informative Trajectory Planner with Mutual Information Approximation for Target Search and Tracking

Abstract:This paper proposes an informative trajectory planning approach, namely, \textit{adaptive particle filter tree with sigma point-based mutual information reward approximation} (ASPIRe), for mobile target search and tracking (SAT) in cluttered environments with limited sensing field of view. We develop a novel sigma point-based approximation to accurately estimate mutual information (MI) for general, non-Gaussian distributions utilizing particle representation of the belief state, while simultaneously maintaining high computational efficiency. Building upon the MI approximation, we develop the Adaptive Particle Filter Tree (APFT) approach with MI as the reward, which features belief state tree nodes for informative trajectory planning in continuous state and measurement spaces. An adaptive criterion is proposed in APFT to adjust the planning horizon based on the expected information gain. Simulations and physical experiments demonstrate that ASPIRe achieves real-time computation and outperforms benchmark methods in terms of both search efficiency and estimation accuracy.

* accepted to ICRA 2024

Via

Access Paper or Ask Questions

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments

Feb 29, 2024

Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, Chang Liu

Abstract:Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancy from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset could be found at https://DOZE-Dataset.github.io/.

Via

Access Paper or Ask Questions

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Jan 05, 2024

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, Chang Liu

Figure 1 for VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Figure 2 for VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Figure 3 for VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Figure 4 for VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Abstract:In the realm of household robotics, the Zero-Shot Object Navigation (ZSON) task empowers agents to adeptly traverse unfamiliar environments and locate objects from novel categories without prior explicit training. This paper introduces VoroNav, a novel semantic exploration framework that proposes the Reduced Voronoi Graph to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model (LLM). Our approach presents a synergy of path and farsight descriptions to represent the environmental context, enabling the LLM to apply commonsense reasoning to ascertain the optimal waypoints for navigation. Extensive evaluation on the HM3D and HSSD datasets validates that VoroNav surpasses existing ZSON benchmarks in both success rates and exploration efficiency (+2.8% Success and +3.7% SPL on HM3D, +2.6% Success and +3.8% SPL on HSSD). Additionally introduced metrics that evaluate obstacle avoidance proficiency and perceptual efficiency further corroborate the enhancements achieved by our method in ZSON planning.

Via

Access Paper or Ask Questions

OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Sep 19, 2023

Yang Gao, Ji Ma, Ivan Korotkov, Keith Hall, Dana Alon, Don Metzler

Figure 1 for OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Figure 2 for OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Figure 3 for OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Figure 4 for OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Abstract:We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English capabilities to create better representations for non-English papers. Our best model significantly outperforms strong baselines by 7-16% (in mean average precision).

* Scripts for constructing the OpenMSD dataset is available at: https://github.com/google-research/google-research/tree/master/OpenMSD

Via

Access Paper or Ask Questions