Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Yin

Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Dec 21, 2024

Hang Yin, Zhifeng Lin, Xin Liu, Bin Sun, Kan Li

Figure 1 for Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Figure 2 for Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Figure 3 for Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Figure 4 for Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Abstract:Direction reasoning is essential for intelligent systems to understand the real world. While existing work focuses primarily on spatial reasoning, compass direction reasoning remains underexplored. To address this, we propose the Compass Direction Reasoning (CDR) benchmark, designed to evaluate the direction reasoning capabilities of multimodal language models (MLMs). CDR includes three types images to test spatial (up, down, left, right) and compass (north, south, east, west) directions. Our evaluation reveals that most MLMs struggle with direction reasoning, often performing at random guessing levels. Experiments show that training directly with CDR data yields limited improvements, as it requires an understanding of real-world physical rules. We explore the impact of mixdata and CoT fine-tuning methods, which significantly enhance MLM performance in compass direction reasoning by incorporating diverse data and step-by-step reasoning, improving the model's ability to understand direction relationships.

Via

Access Paper or Ask Questions

Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

Nov 12, 2024

Jie Guo, Meiting Wang, Hang Yin, Bin Song, Yuhao Chi, Fei Richard Yu, Chau Yuen

Figure 1 for Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

Figure 2 for Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

Figure 3 for Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

Figure 4 for Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

Abstract:Artificial intelligence generated content (AIGC) technologies, with a predominance of large language models (LLMs), have demonstrated remarkable performance improvements in various applications, which have attracted great interests from both academia and industry. Although some noteworthy advancements have been made in this area, a comprehensive exploration of the intricate relationship between AIGC and communication networks remains relatively limited. To address this issue, this paper conducts an exhaustive survey from dual standpoints: firstly, it scrutinizes the integration of LLMs and AIGC technologies within the domain of communication networks; secondly, it investigates how the communication networks can further bolster the capabilities of LLMs and AIGC. Additionally, this research explores the promising applications along with the challenges encountered during the incorporation of these AI technologies into communication networks. Through these detailed analyses, our work aims to deepen the understanding of how LLMs and AIGC can synergize with and enhance the development of advanced intelligent communication networks, contributing to a more profound comprehension of next-generation intelligent communication networks.

* Accepted by IEEE Internet of Things Journal

Via

Access Paper or Ask Questions

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Oct 10, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, Jiwen Lu

Figure 1 for SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Figure 2 for SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Figure 3 for SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Figure 4 for SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Abstract:In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

* Accepted to NeurIPS 2024. Project page: https://bagh2178.github.io/SG-Nav/

Via

Access Paper or Ask Questions

Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Oct 02, 2024

Alfredo Reichlin, Gustaf Tegnér, Miguel Vasco, Hang Yin, Mårten Björkman, Danica Kragic

Figure 1 for Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Figure 2 for Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Figure 3 for Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Figure 4 for Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Abstract:Given a finite set of sample points, meta-learning algorithms aim to learn an optimal adaptation strategy for new, unseen tasks. Often, this data can be ambiguous as it might belong to different tasks concurrently. This is particularly the case in meta-regression tasks. In such cases, the estimated adaptation strategy is subject to high variance due to the limited amount of support data for each task, which often leads to sub-optimal generalization performance. In this work, we address the problem of variance reduction in gradient-based meta-learning and formalize the class of problems prone to this, a condition we refer to as \emph{task overlap}. Specifically, we propose a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters. To estimate the posterior, we utilize the Laplace approximation, which allows us to express the variance in terms of the curvature of the loss landscape of our meta-learner. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of variance reduction in meta-learning.

Via

Access Paper or Ask Questions

Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

Sep 06, 2024

Jianwei Zhu, Hang Yin, Shunfan Zhou

Abstract:This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various models and token lengths, focusing on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results show that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily due to data transfer. For most typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing near-zero overhead.

Via

Access Paper or Ask Questions

Unfolding the Literature: A Review of Robotic Cloth Manipulation

Jul 01, 2024

Alberta Longhini, Yufei Wang, Irene Garcia-Camacho, David Blanco-Mulero, Marco Moletta, Michael Welle, Guillem Alenyà, Hang Yin, Zackory Erickson, David Held(+2 more)

Figure 1 for Unfolding the Literature: A Review of Robotic Cloth Manipulation

Figure 2 for Unfolding the Literature: A Review of Robotic Cloth Manipulation

Figure 3 for Unfolding the Literature: A Review of Robotic Cloth Manipulation

Figure 4 for Unfolding the Literature: A Review of Robotic Cloth Manipulation

Abstract:The realm of textiles spans clothing, households, healthcare, sports, and industrial applications. The deformable nature of these objects poses unique challenges that prior work on rigid objects cannot fully address. The increasing interest within the community in textile perception and manipulation has led to new methods that aim to address challenges in modeling, perception, and control, resulting in significant progress. However, this progress is often tailored to one specific textile or a subcategory of these textiles. To understand what restricts these methods and hinders current approaches from generalizing to a broader range of real-world textiles, this review provides an overview of the field, focusing specifically on how and to what extent textile variations are addressed in modeling, perception, benchmarking, and manipulation of textiles. We finally conclude by identifying key open problems and outlining grand challenges that will drive future advancements in the field.

* 30 pages, 3 figures, 2 tables. Submitted to Annual Review of Control, Robotics, and Autonomous Systems

Via

Access Paper or Ask Questions

Entity Alignment with Unlabeled Dangling Cases

Mar 16, 2024

Hang Yin, Dong Ding, Liyao Xiang, Yuheng He, Yihan Wu, Xinbing Wang, Chenghu Zhou

Figure 1 for Entity Alignment with Unlabeled Dangling Cases

Figure 2 for Entity Alignment with Unlabeled Dangling Cases

Figure 3 for Entity Alignment with Unlabeled Dangling Cases

Figure 4 for Entity Alignment with Unlabeled Dangling Cases

Abstract:We investigate the entity alignment problem with unlabeled dangling cases, meaning that there are entities in the source or target graph having no counterparts in the other, and those entities remain unlabeled. The problem arises when the source and target graphs are of different scales, and it is much cheaper to label the matchable pairs than the dangling entities. To solve the issue, we propose a novel GNN-based dangling detection and entity alignment framework. While the two tasks share the same GNN and are trained together, the detected dangling entities are removed in the alignment. Our framework is featured by a designed entity and relation attention mechanism for selective neighborhood aggregation in representation learning, as well as a positive-unlabeled learning loss for an unbiased estimation of dangling entities. Experimental results have shown that each component of our design contributes to the overall alignment performance which is comparable or superior to baselines, even if the baselines additionally have 30\% of the dangling entities labeled as training data.

* 14 pages

Via

Access Paper or Ask Questions

Meta Operator for Complex Query Answering on Knowledge Graphs

Mar 15, 2024

Hang Yin, Zihao Wang, Yangqiu Song

Figure 1 for Meta Operator for Complex Query Answering on Knowledge Graphs

Figure 2 for Meta Operator for Complex Query Answering on Knowledge Graphs

Figure 3 for Meta Operator for Complex Query Answering on Knowledge Graphs

Figure 4 for Meta Operator for Complex Query Answering on Knowledge Graphs

Abstract:Knowledge graphs contain informative factual knowledge but are considered incomplete. To answer complex queries under incomplete knowledge, learning-based Complex Query Answering (CQA) models are proposed to directly learn from the query-answer samples to avoid the direct traversal of incomplete graph data. Existing works formulate the training of complex query answering models as multi-task learning and require a large number of training samples. In this work, we explore the compositional structure of complex queries and argue that the different logical operator types, rather than the different complex query types, are the key to improving generalizability. Accordingly, we propose a meta-learning algorithm to learn the meta-operators with limited data and adapt them to different instances of operators under various complex queries. Empirical results show that learning meta-operators is more effective than learning original CQA or meta-CQA models.

Via

Access Paper or Ask Questions

LM2D: Lyrics- and Music-Driven Dance Synthesis

Mar 14, 2024

Wenjie Yin, Xuejiao Zhao, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman

Figure 1 for LM2D: Lyrics- and Music-Driven Dance Synthesis

Figure 2 for LM2D: Lyrics- and Music-Driven Dance Synthesis

Figure 3 for LM2D: Lyrics- and Music-Driven Dance Synthesis

Figure 4 for LM2D: Lyrics- and Music-Driven Dance Synthesis

Abstract:Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: https://youtu.be/4XCgvYookvA.

Via

Access Paper or Ask Questions

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Mar 14, 2024

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez(+25 more)

Figure 1 for BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Figure 2 for BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Figure 3 for BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Figure 4 for BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Abstract:We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu.

* A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)

Via

Access Paper or Ask Questions