Abstract:This work investigates how semantics influence localisation performance and robustness in a learned self-supervised, contrastive semantic localisation framework. After training a localisation network on both original and perturbed maps, we conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter. We validate various interpretability methods and present a comparative reliability analysis. Integrated gradients and Attention Weights consistently emerge as the most reliable probes of learned behaviour. A semantic class ablation further reveals an implicit weighting in which frequent objects are often down-weighted. Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.
Abstract:This work explores the application of ensemble modeling to the multidimensional regression problem of trajectory prediction for vehicles in urban environments. As newer and bigger state-of-the-art prediction models for autonomous driving continue to emerge, an important open challenge is the problem of how to combine the strengths of these big models without the need for costly re-training. We show how, perhaps surprisingly, combining state-of-the-art deep learning models out-of-the-box (without retraining or fine-tuning) with a simple confidence-weighted average method can enhance the overall prediction. Indeed, while combining trajectory prediction models is not straightforward, this simple approach enhances performance by 10% over the best prediction model, especially in the long-tailed metrics. We show that this performance improvement holds on both the NuScenes and Argoverse datasets, and that these improvements are made across the dataset distribution. The code for our work is open source.
Abstract:Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.
Abstract:This paper proposes a task-oriented co-design framework that integrates communication, computing, and control to address the key challenges of bandwidth limitations, noise interference, and latency in mission-critical industrial Cyber-Physical Systems (CPS). To improve communication efficiency and robustness, we design a task-oriented Joint Source-Channel Coding (JSCC) using Information Bottleneck (IB) to enhance data transmission efficiency by prioritizing task-specific information. To mitigate the perceived End-to-End (E2E) delays, we develop a Delay-Aware Trajectory-Guided Control Prediction (DTCP) strategy that integrates trajectory planning with control prediction, predicting commands based on E2E delay. Moreover, the DTCP is co-designed with task-oriented JSCC, focusing on transmitting task-specific information for timely and reliable autonomous driving. Experimental results in the CARLA simulator demonstrate that, under an E2E delay of 1 second (20 time slots), the proposed framework achieves a driving score of 48.12, which is 31.59 points higher than using Better Portable Graphics (BPG) while reducing bandwidth usage by 99.19%.
Abstract:For several tasks, ranging from manipulation to inspection, it is beneficial for robots to localize a target object in their surroundings. In this paper, we propose an approach that utilizes coarse point clouds obtained from miniaturized VL53L5CX Time-of-Flight (ToF) sensors (tiny lidars) to localize a target object in the robot's workspace. We first conduct an experimental campaign to calibrate the dependency of sensor readings on relative range and orientation to targets. We then propose a probabilistic sensor model that is validated in an object pose estimation task using a Particle Filter (PF). The results show that the proposed sensor model improves the performance of the localization of the target object with respect to two baselines: one that assumes measurements are free from uncertainty and one in which the confidence is provided by the sensor datasheet.
Abstract:This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.




Abstract:This paper proposes a method for on-demand scenario generation in simulation, grounded on real-world data. Evaluating the behaviour of Autonomous Vehicles (AVs) in both safety-critical and regular scenarios is essential for assessing their robustness before real-world deployment. By integrating scenarios derived from real-world datasets into the simulation, we enhance the plausibility and validity of testing sets. This work introduces a novel approach that employs temporal scene graphs to capture evolving spatiotemporal relationships among scene entities from a real-world dataset, enabling the generation of dynamic scenarios in simulation through Graph Neural Networks (GNNs). User-defined action and criticality conditioning are used to ensure flexible, tailored scenario creation. Our model significantly outperforms the benchmarks in accurately predicting links corresponding to the requested scenarios. We further evaluate the validity and compatibility of our generated scenarios in an off-the-shelf simulator.




Abstract:This study explores the intersection of neural networks and classical robotics algorithms through the Neural Algorithmic Reasoning (NAR) framework, allowing to train neural networks to effectively reason like classical robotics algorithms by learning to execute them. Algorithms are integral to robotics and safety-critical applications due to their predictable and consistent performance through logical and mathematical principles. In contrast, while neural networks are highly adaptable, handling complex, high-dimensional data and generalising across tasks, they often lack interpretability and transparency in their internal computations. We propose a Graph Neural Network (GNN)-based learning framework, NAR-*ICP, which learns the intermediate algorithmic steps of classical ICP-based pointcloud registration algorithms, and extend the CLRS Algorithmic Reasoning Benchmark with classical robotics perception algorithms. We evaluate our approach across diverse datasets, from real-world to synthetic, demonstrating its flexibility in handling complex and noisy inputs, along with its potential to be used as part of a larger learning system. Our results indicate that our method achieves superior performance across all benchmarks and datasets, consistently surpassing even the algorithms it has been trained on, further demonstrating its ability to generalise beyond the capabilities of traditional algorithms.
Abstract:This paper investigates the utility of Neural Radiance Fields (NeRF) models in extending the regions of operation of a mobile robot, controlled by Image-Based Visual Servoing (IBVS) via static CCTV cameras. Using NeRF as a 3D-representation prior, the robot's footprint may be extrapolated geometrically and used to train a CNN-based network to extract it online from the robot's appearance alone. The resulting footprint results in a tighter bound than a robot-wide bounding box, allowing the robot's controller to prescribe more optimal trajectories and expand its safe operational floor area.




Abstract:We describe a challenging robotics deployment in a complex ecosystem to monitor a rich plant community. The study site is dominated by dynamic grassland vegetation and is thus visually ambiguous and liable to drastic appearance change over the course of a day and especially through the growing season. This dynamism and complexity in appearance seriously impact the stability of the robotics platform, as localisation is a foundational part of that control loop, and so routes must be carefully taught and retaught until autonomy is robust and repeatable. Our system is demonstrated over a 6-week period monitoring the response of grass species to experimental climate change manipulations. We also discuss the applicability of our pipeline to monitor biodiversity in other complex natural settings.