In the rapidly evolving landscape of autonomous driving, the capability to accurately predict future events and assess their implications is paramount for both safety and efficiency, critically aiding the decision-making process. World models have emerged as a transformative approach, enabling autonomous driving systems to synthesize and interpret vast amounts of sensor data, thereby predicting potential future scenarios and compensating for information gaps. This paper provides an initial review of the current state and prospective advancements of world models in autonomous driving, spanning their theoretical underpinnings, practical applications, and the ongoing research efforts aimed at overcoming existing limitations. Highlighting the significant role of world models in advancing autonomous driving technologies, this survey aspires to serve as a foundational reference for the research community, facilitating swift access to and comprehension of this burgeoning field, and inspiring continued innovation and exploration.
In autonomous vehicle (AV) technology, the ability to accurately predict the movements of surrounding vehicles is paramount for ensuring safety and operational efficiency. Incorporating human decision-making insights enables AVs to more effectively anticipate the potential actions of other vehicles, significantly improving prediction accuracy and responsiveness in dynamic environments. This paper introduces the Human-Like Trajectory Prediction (HLTP) model, which adopts a teacher-student knowledge distillation framework inspired by human cognitive processes. The HLTP model incorporates a sophisticated teacher-student knowledge distillation framework. The "teacher" model, equipped with an adaptive visual sector, mimics the visual processing of the human brain, particularly the functions of the occipital and temporal lobes. The "student" model focuses on real-time interaction and decision-making, drawing parallels to prefrontal and parietal cortex functions. This approach allows for dynamic adaptation to changing driving scenarios, capturing essential perceptual cues for accurate prediction. Evaluated using the Macao Connected and Autonomous Driving (MoCAD) dataset, along with the NGSIM and HighD benchmarks, HLTP demonstrates superior performance compared to existing models, particularly in challenging environments with incomplete data. The project page is available at Github.
Predicting the trajectories of surrounding agents is still considered one of the most challenging tasks for autonomous driving. In this paper, we introduce a multi-modal trajectory prediction framework based on the transformer network. The semantic maps of each agent are used as inputs to convolutional networks to automatically derive relevant contextual information. A novel auxiliary loss that penalizes unfeasible off-road predictions is also proposed in this study. Experiments on the Lyft l5kit dataset show that the proposed model achieves state-of-the-art performance, substantially improving the accuracy and feasibility of the prediction outcomes.
In the burgeoning field of autonomous vehicles (AVs), trajectory prediction remains a formidable challenge, especially in mixed autonomy environments. Traditional approaches often rely on computational methods such as time-series analysis. Our research diverges significantly by adopting an interdisciplinary approach that integrates principles of human cognition and observational behavior into trajectory prediction models for AVs. We introduce a novel "adaptive visual sector" mechanism that mimics the dynamic allocation of attention human drivers exhibit based on factors like spatial orientation, proximity, and driving speed. Additionally, we develop a "dynamic traffic graph" using Convolutional Neural Networks (CNN) and Graph Attention Networks (GAT) to capture spatio-temporal dependencies among agents. Benchmark tests on the NGSIM, HighD, and MoCAD datasets reveal that our model (GAVA) outperforms state-of-the-art baselines by at least 15.2%, 19.4%, and 12.0%, respectively. Our findings underscore the potential of leveraging human cognition principles to enhance the proficiency and adaptability of trajectory prediction algorithms in AVs. The code for the proposed model is available at our Github.
The ability to accurately predict the trajectory of surrounding vehicles is a critical hurdle to overcome on the journey to fully autonomous vehicles. To address this challenge, we pioneer a novel behavior-aware trajectory prediction model (BAT) that incorporates insights and findings from traffic psychology, human behavior, and decision-making. Our model consists of behavior-aware, interaction-aware, priority-aware, and position-aware modules that perceive and understand the underlying interactions and account for uncertainty and variability in prediction, enabling higher-level learning and flexibility without rigid categorization of driving behavior. Importantly, this approach eliminates the need for manual labeling in the training process and addresses the challenges of non-continuous behavior labeling and the selection of appropriate time windows. We evaluate BAT's performance across the Next Generation Simulation (NGSIM), Highway Drone (HighD), Roundabout Drone (RounD), and Macao Connected Autonomous Driving (MoCAD) datasets, showcasing its superiority over prevailing state-of-the-art (SOTA) benchmarks in terms of prediction accuracy and efficiency. Remarkably, even when trained on reduced portions of the training data (25%), our model outperforms most of the baselines, demonstrating its robustness and efficiency in predicting vehicle trajectories, and the potential to reduce the amount of data required to train autonomous vehicles, especially in corner cases. In conclusion, the behavior-aware model represents a significant advancement in the development of autonomous vehicles capable of predicting trajectories with the same level of proficiency as human drivers. The project page is available at https://github.com/Petrichor625/BATraj-Behavior-aware-Model.
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.
Anomaly detection in images plays a significant role for many applications across all industries, such as disease diagnosis in healthcare or quality assurance in manufacturing. Manual inspection of images, when extended over a monotonously repetitive period of time is very time consuming and can lead to anomalies being overlooked.Artificial neural networks have proven themselves very successful on simple, repetitive tasks, in some cases even outperforming humans. Therefore, in this paper we investigate different methods of deep learning, including supervised and unsupervised learning, for anomaly detection applied to a quality assurance use case. We utilize the MVTec anomaly dataset and develop three different models, a CNN for supervised anomaly detection, KD-CAE for autoencoder anomaly detection, NI-CAE for noise induced anomaly detection and a DCGAN for generating reconstructed images. By experiments, we found that KD-CAE performs better on the anomaly datasets compared to CNN and NI-CAE, with NI-CAE performing the best on the Transistor dataset. We also implemented a DCGAN for the creation of new training data but due to computational limitation and lack of extrapolating the mechanics of AnoGAN, we restricted ourselves just to the generation of GAN based images. We conclude that unsupervised methods are more powerful for anomaly detection in images, especially in a setting where only a small amount of anomalous data is available, or the data is unlabeled.
Inefficient traffic signal control methods may cause numerous problems, such as traffic congestion and waste of energy. Reinforcement learning (RL) is a trending data-driven approach for adaptive traffic signal control in complex urban traffic networks. Although the development of deep neural networks (DNN) further enhances its learning capability, there are still some challenges in applying deep RLs to transportation networks with multiple signalized intersections, including non-stationarity environment, exploration-exploitation dilemma, multi-agent training schemes, continuous action spaces, etc. In order to address these issues, this paper first proposes a multi-agent deep deterministic policy gradient (MADDPG) method by extending the actor-critic policy gradient algorithms. MADDPG has a centralized learning and decentralized execution paradigm in which critics use additional information to streamline the training process, while actors act on their own local observations. The model is evaluated via simulation on the Simulation of Urban MObility (SUMO) platform. Model comparison results show the efficiency of the proposed algorithm in controlling traffic lights.
Inefficient traffic control may cause numerous problems such as traffic congestion and energy waste. This paper proposes a novel multi-agent reinforcement learning method, named KS-DDPG (Knowledge Sharing Deep Deterministic Policy Gradient) to achieve optimal control by enhancing the cooperation between traffic signals. By introducing the knowledge-sharing enabled communication protocol, each agent can access to the collective representation of the traffic environment collected by all agents. The proposed method is evaluated through two experiments respectively using synthetic and real-world datasets. The comparison with state-of-the-art reinforcement learning-based and conventional transportation methods demonstrate the proposed KS-DDPG has significant efficiency in controlling large-scale transportation networks and coping with fluctuations in traffic flow. In addition, the introduced communication mechanism has also been proven to speed up the convergence of the model without significantly increasing the computational burden.