Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gabriel Villalonga

Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

Feb 09, 2026

Jorge Daniel Rodríguez-Vidal, Gabriel Villalonga, Diego Porres, Antonio M. López Peña

Abstract:End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Guiding Attention in End-to-End Driving Models

Apr 30, 2024

Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. López

Figure 1 for Guiding Attention in End-to-End Driving Models

Figure 2 for Guiding Attention in End-to-End Driving Models

Figure 3 for Guiding Attention in End-to-End Driving Models

Figure 4 for Guiding Attention in End-to-End Driving Models

Abstract:Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

* Accepted for publication at the 35th IEEE Intelligent Vehicles Symposium (IV 2024)

Via

Access Paper or Ask Questions

Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

May 31, 2022

Jose L. Gómez, Gabriel Villalonga, Antonio M. López

Figure 1 for Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

Figure 2 for Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

Figure 3 for Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

Figure 4 for Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

Abstract:Semantic image segmentation is addressed by training deep models. Since supervised training draws to a curse of human-based image labeling, using synthetic images with automatically generated ground truth together with unlabeled real-world images is a promising alternative. This implies to address an unsupervised domain adaptation (UDA) problem. In this paper, we proposed a new co-training process for synth-to-real UDA of semantic segmentation models. First, we design a self-training procedure which provides two initial models. Then, we keep training these models in a collaborative manner for obtaining the final model. The overall process treats the deep models as black boxes and drives their collaboration at the level of pseudo-labeled target images, {\ie}, neither modifying loss functions is required, nor explicit feature alignment. We test our proposal on standard synthetic and real-world datasets. Our co-training shows improvements of 15-20 percentage points of mIoU over baselines, so establishing new state-of-the-art results.

Via

Access Paper or Ask Questions

Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Apr 23, 2021

Jose L. Gómez, Gabriel Villalonga, Antonio M. López

Figure 1 for Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Figure 2 for Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Figure 3 for Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Figure 4 for Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Abstract:Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on pair; at least, when using an off-the-shelf depth estimation model not specifically trained on the translated images.

Via

Access Paper or Ask Questions

Co-training for On-board Deep Object Detection

Aug 12, 2020

Gabriel Villalonga, Antonio M. Lopez

Figure 1 for Co-training for On-board Deep Object Detection

Figure 2 for Co-training for On-board Deep Object Detection

Figure 3 for Co-training for On-board Deep Object Detection

Figure 4 for Co-training for On-board Deep Object Detection

Abstract:Providing ground truth supervision to train visual models has been a bottleneck over the years, exacerbated by domain shifts which degenerate the performance of such models. This was the case when visual tasks relied on handcrafted features and shallow machine learning and, despite its unprecedented performance gains, the problem remains open within the deep learning paradigm due to its data-hungry nature. Best performing deep vision-based object detectors are trained in a supervised manner by relying on human-labeled bounding boxes which localize class instances (i.e.objects) within the training images.Thus, object detection is one of such tasks for which human labeling is a major bottleneck. In this paper, we assess co-training as a semi-supervised learning method for self-labeling objects in unlabeled images, so reducing the human-labeling effort for developing deep object detectors. Our study pays special attention to a scenario involving domain shift; in particular, when we have automatically generated virtual-world images with object bounding boxes and we have real-world images which are unlabeled. Moreover, we are particularly interested in using co-training for deep object detection in the context of driver assistance systems and/or self-driving vehicles. Thus, using well-established datasets and protocols for object detection in these application contexts, we will show how co-training is a paradigm worth to pursue for alleviating object labeling, working both alone and together with task-agnostic domain adaptation.

Via

Access Paper or Ask Questions

Temporal Coherence for Active Learning in Videos

Aug 30, 2019

Javad Zolfaghari Bengar, Abel Gonzalez-Garcia, Gabriel Villalonga, Bogdan Raducanu, Hamed H. Aghdam, Mikhail Mozerov, Antonio M. Lopez, Joost van de Weijer

Figure 1 for Temporal Coherence for Active Learning in Videos

Figure 2 for Temporal Coherence for Active Learning in Videos

Figure 3 for Temporal Coherence for Active Learning in Videos

Figure 4 for Temporal Coherence for Active Learning in Videos

Abstract:Autonomous driving systems require huge amounts of data to train. Manual annotation of this data is time-consuming and prohibitively expensive since it involves human resources. Therefore, active learning emerged as an alternative to ease this effort and to make data annotation more manageable. In this paper, we introduce a novel active learning approach for object detection in videos by exploiting temporal coherence. Our active learning criterion is based on the estimated number of errors in terms of false positives and false negatives. The detections obtained by the object detector are used to define the nodes of a graph and tracked forward and backward to temporally link the nodes. Minimizing an energy function defined on this graphical model provides estimates of both false positives and false negatives. Additionally, we introduce a synthetic video dataset, called SYNTHIA-AL, specially designed to evaluate active learning for video object detection in road scenes. Finally, we show that our approach outperforms active learning baselines tested on two datasets.

* Accepted at ICCVW 2019 (CVRSUAD-Road Scene Understanding and Autonomous Driving)

Via

Access Paper or Ask Questions