Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Time": models, code, and papers

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Feb 08, 2024
Xing Han Lù, Zdeněk Kasner, Siva Reddy

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

Via

Access Paper or Ask Questions

SIMPL: A Simple and Efficient Multi-agent Motion Prediction Baseline for Autonomous Driving

Feb 04, 2024
Lu Zhang, Peiliang Li, Sikang Liu, Shaojie Shen

This paper presents a Simple and effIcient Motion Prediction baseLine (SIMPL) for autonomous vehicles. Unlike conventional agent-centric methods with high accuracy but repetitive computations and scene-centric methods with compromised accuracy and generalizability, SIMPL delivers real-time, accurate motion predictions for all relevant traffic participants. To achieve improvements in both accuracy and inference speed, we propose a compact and efficient global feature fusion module that performs directed message passing in a symmetric manner, enabling the network to forecast future motion for all road users in a single feed-forward pass and mitigating accuracy loss caused by viewpoint shifting. Additionally, we investigate the continuous trajectory parameterization using Bernstein basis polynomials in trajectory decoding, allowing evaluations of states and their higher-order derivatives at any desired time point, which is valuable for downstream planning tasks. As a strong baseline, SIMPL exhibits highly competitive performance on Argoverse 1 & 2 motion forecasting benchmarks compared with other state-of-the-art methods. Furthermore, its lightweight design and low inference latency make SIMPL highly extensible and promising for real-world onboard deployment. We open-source the code at https://github.com/HKUST-Aerial-Robotics/SIMPL.

* Code is available at https://github.com/HKUST-Aerial-Robotics/SIMPL

Via

Access Paper or Ask Questions

Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

Feb 07, 2024
Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble

Multimodal Federated Learning (MMFL) utilizes multiple modalities in each client to build a more powerful Federated Learning (FL) model than its unimodal counterpart. However, the impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we assess the capability of client-level and server-level regularization techniques towards mitigating modality incongruity effects. Experiments are conducted under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I, with Chest X-Ray and radiology reports.

* 42 pages

Via

Access Paper or Ask Questions

Geometric Slosh-Free Tracking for Robotic Manipulators

Feb 07, 2024
Jon Arrizabalaga, Lukas Pries, Riddhiman Laha, Runkang Li, Sami Haddadin, Markus Ryll

This work focuses on the agile transportation of liquids with robotic manipulators. In contrast to existing methods that are either computationally heavy, system/container specific or dependant on a singularity-prone pendulum model, we present a real-time slosh-free tracking technique. This method solely requires the reference trajectory and the robot's kinematic constraints to output kinematically feasible joint space commands. The crucial element underlying this approach consists on mimicking the end-effector's motion through a virtual quadrotor, which is inherently slosh-free and differentially flat, thereby allowing us to calculate a slosh-free reference orientation. Through the utilization of a cascaded proportional-derivative (PD) controller, this slosh-free reference is transformed into task space acceleration commands, which, following the resolution of a Quadratic Program (QP) based on Resolved Acceleration Control (RAC), are translated into a feasible joint configuration. The validity of the proposed approach is demonstrated by simulated and real-world experiments on a 7 DoF Franka Emika Panda robot. Code: https://github.com/jonarriza96/gsft Video: https://youtu.be/4kitqYVS9n8

* This paper has been accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, May 2024. Copyright @ IEEE

Via

Access Paper or Ask Questions

Moco: A Learnable Meta Optimizer for Combinatorial Optimization

Feb 07, 2024
Tim Dernedde, Daniela Thyssens, Sören Dittrich, Maximilan Stubbemann, Lars Schmidt-Thieme

Relevant combinatorial optimization problems (COPs) are often NP-hard. While they have been tackled mainly via handcrafted heuristics in the past, advances in neural networks have motivated the development of general methods to learn heuristics from data. Many approaches utilize a neural network to directly construct a solution, but are limited in further improving based on already constructed solutions at inference time. Our approach, Moco, learns a graph neural network that updates the solution construction procedure based on features extracted from the current search state. This meta training procedure targets the overall best solution found during the search procedure given information such as the search budget. This allows Moco to adapt to varying circumstances such as different computational budgets. Moco is a fully learnable meta optimizer that does not utilize any problem specific local search or decomposition. We test Moco on the Traveling Salesman Problem (TSP) and Maximum Independent Set (MIS) and show that it outperforms other approaches on MIS and is overall competitive on the TSP, especially outperforming related approaches, partially even if they use additional local search.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions

Hierarchical Tree-structured Knowledge Graph For Academic Insight Survey

Feb 07, 2024
Jinghong Li, Huy Phan, Wen Gu, Koichi Ota, Shinobu Hasegawa

Research surveys have always posed a challenge for beginner researchers who lack of research training. These researchers struggle to understand the directions within their research topic, and the discovery of new research findings within a short time. One way to provide intuitive assistance to beginner researchers is by offering relevant knowledge graphs(KG) and recommending related academic papers. However, existing navigation knowledge graphs primarily rely on keywords in the research field and often fail to present the logical hierarchy among multiple related papers clearly. Moreover, most recommendation systems for academic papers simply rely on high text similarity, which can leave researchers confused as to why a particular article is being recommended. They may lack of grasp important information about the insight connection between "Issue resolved" and "Issue finding" that they hope to obtain. To address these issues, this study aims to support research insight surveys for beginner researchers by establishing a hierarchical tree-structured knowledge graph that reflects the inheritance insight of research topics and the relevance insight among the academic papers.

* This paper will submit to '27th International Symposium on Methodologies for Intelligent Systems'(ISMIS 2024)

Via

Access Paper or Ask Questions

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

Feb 07, 2024
Esteve Valls Mascaro, Yashuai Yan, Dongheui Lee

Integrating robots into populated environments is a complex challenge that requires an understanding of human social dynamics. In this work, we propose to model social motion forecasting in a shared human-robot representation space, which facilitates us to synthesize robot motions that interact with humans in social scenarios despite not observing any robot in the motion training. We develop a transformer-based architecture called ECHO, which operates in the aforementioned shared space to predict the future motions of the agents encountered in social scenarios. Contrary to prior works, we reformulate the social motion problem as the refinement of the predicted individual motions based on the surrounding agents, which facilitates the training while allowing for single-motion forecasting when only one human is in the scene. We evaluate our model in multi-person and human-robot motion forecasting tasks and obtain state-of-the-art performance by a large margin while being efficient and performing in real-time. Additionally, our qualitative results showcase the effectiveness of our approach in generating human-robot interaction behaviors that can be controlled via text commands.

* Accepted at ICRA 2024

Via

Access Paper or Ask Questions

AINS: Affordable Indoor Navigation Solution via Line Color Identification Using Mono-Camera for Autonomous Vehicles

Feb 07, 2024
Nizamuddin Maitlo, Nooruddin Noonari, Kaleem Arshid, Naveed Ahmed, Sathishkumar Duraisamy

Recently, researchers have been exploring various ways to improve the effectiveness and efficiency of autonomous vehicles by researching new methods, especially for indoor scenarios. Autonomous Vehicles in indoor navigation systems possess many challenges especially the limited accuracy of GPS in indoor scenarios. Several, robust methods have been explored for autonomous vehicles in indoor scenarios to solve this problem, but the ineffectiveness of the proposed methods is the high deployment cost. To address the above-mentioned problems we have presented A low-cost indoor navigation method for autonomous vehicles called Affordable Indoor Navigation Solution (AINS) which is based on based on Monocular Camera. Our proposed solution is mainly based on a mono camera without relying on various huge or power-inefficient sensors to find the path, such as range finders and other navigation sensors. Our proposed method shows that we can deploy autonomous vehicles indoor navigation systems while taking into consideration the cost. We can observe that the results shown by our solution are better than existing solutions and we can reduce the estimated error and time consumption.

Via

Access Paper or Ask Questions

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

Feb 07, 2024
Fangzhao Zhang, Mert Pilanci

In this work we study the enhancement of Low Rank Adaptation (LoRA) fine-tuning procedure by introducing a Riemannian preconditioner in its optimization step. Specifically, we introduce an $r\times r$ preconditioner in each gradient step where $r$ is the LoRA rank. This preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with our preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. Theoretically, we show that fine-tuning a two-layer ReLU network in the convex paramaterization with our preconditioner has convergence rate independent of condition number of the data matrix. This new Riemannian preconditioner, previously explored in classic low-rank matrix recovery, is introduced to deep learning tasks for the first time in our work. We release our code at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA.

Via

Access Paper or Ask Questions

A Differentiable Partially Observable Generalized Linear Model with Forward-Backward Message Passing

Feb 07, 2024
Chengrui Li, Weihan Li, Yule Wang, Anqi Wu

The partially observable generalized linear model (POGLM) is a powerful tool for understanding neural connectivity under the assumption of existing hidden neurons. With spike trains only recorded from visible neurons, existing works use variational inference to learn POGLM meanwhile presenting the difficulty of learning this latent variable model. There are two main issues: (1) the sampled Poisson hidden spike count hinders the use of the pathwise gradient estimator in VI; and (2) the existing design of the variational model is neither expressive nor time-efficient, which further affects the performance. For (1), we propose a new differentiable POGLM, which enables the pathwise gradient estimator, better than the score function gradient estimator used in existing works. For (2), we propose the forward-backward message-passing sampling scheme for the variational model. Comprehensive experiments show that our differentiable POGLMs with our forward-backward message passing produce a better performance on one synthetic and two real-world datasets. Furthermore, our new method yields more interpretable parameters, underscoring its significance in neuroscience.

Via

Access Paper or Ask Questions