Abstract:Accurate motion forecasting is critical for safe and efficient autonomous driving, enabling vehicles to predict future trajectories and make informed decisions in complex traffic scenarios. Most of the current designs of motion prediction models are based on the major representation of lane centerlines, which limits their capability to capture critical road environments and traffic rules and constraints. In this work, we propose an enhanced motion forecasting model informed by multiple vector map elements, including lane boundaries and road edges, that facilitates a richer and more complete representation of driving environments. An effective feature fusion strategy is developed to merge information in different vector map components, where the model learns holistic information on road structures and their interactions with agents. Since encoding more information about the road environment increases memory usage and is computationally expensive, we developed an effective pruning mechanism that filters the most relevant map connections to the target agent, ensuring computational efficiency while maintaining essential spatial and semantic relationships for accurate trajectory prediction. Overcoming the limitations of lane centerline-based models, our method provides a more informative and efficient representation of the driving environment and advances the state of the art for autonomous vehicle motion forecasting. We verify our approach with extensive experiments on the Argoverse 2 motion forecasting dataset, where our method maintains competitiveness on AV2 while achieving improved performance. Index Terms-Autonomous driving, trajectory prediction, vector map elements, road topology, connection pruning, Argoverse 2.
Abstract:Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained. This study investigated simple yet efficient methods for adaping previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposed "Beyond-Labels," a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts. Furthermore, we efficiently captured positional information in images using Fourier embeddings, thus improving the generalization across various image sizes. Extensive ablation tests were performed to investigate the important components of our proposed method; when tested against the common benchmark PASCAL-5i, it demonstrated superior performance despite being trained on frozen image and language characteristics.