Abstract:Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
Abstract:Collecting Indoor Environmental Quality (IEQ) data from an occupant's immediate surroundings can provide personalized insights for healthy environmental conditions aligned with occupant preferences, but effective sensor placement for data accuracy and reliability has not been thoroughly explored. This paper explores various positioning of IEQ multi-sensing devices at individual workstations in typical office settings, aiming to identify sensor placements that most accurately reflect the environmental conditions experienced by occupants. We examined five unique positions close to an occupant (above and below the monitor, right side of the desk, ceiling, and chair backrest), two orientations, and three desk locations characterized by different lighting levels, thermal and airflow conditions. Data on temperature, humidity, carbon dioxide (CO2), particulate matters (PM1, PM2.5, PM10), illuminance, and sound were collected over a 2-week longitudinal experiment, followed by short-term experiments simulating common pollution events such as coughing and sneezing. Principal Component Analysis, Spearman's rank correlation, R2, and Mean Absolute Error were applied to identify the position and orientation that best captures the most information and matches breathing zone measurements. It was found that above the monitor position, facing the occupant, best captures the IEQ conditions experienced by the occupant.