Abstract:Datasets pertaining to autonomous vehicles (AVs) hold significant promise for a range of research fields, including artificial intelligence (AI), autonomous driving, and transportation engineering. Nonetheless, these datasets often encounter challenges related to the states of traffic signals, such as missing or inaccurate data. Such issues can compromise the reliability of the datasets and adversely affect the performance of models developed using them. This research introduces a fully automated approach designed to tackle these issues by utilizing available vehicle trajectory data alongside knowledge from the transportation domain to effectively impute and rectify traffic signal information within the Waymo Open Motion Dataset (WOMD). The proposed method is robust and flexible, capable of handling diverse intersection geometries and traffic signal configurations in real-world scenarios. Comprehensive validations have been conducted on the entire WOMD, focusing on over 360,000 relevant scenarios involving traffic signals, out of a total of 530,000 real-world driving scenarios. In the original dataset, 71.7% of traffic signal states are either missing or unknown, all of which were successfully imputed by our proposed method. Furthermore, in the absence of ground-truth signal states, the accuracy of our approach is evaluated based on the rate of red-light violations among vehicle trajectories. Results show that our method reduces the estimated red-light running rate from 15.7% in the original data to 2.9%, thereby demonstrating its efficacy in rectifying data inaccuracies. This paper significantly enhances the quality of AV datasets, contributing to the wider AI and AV research communities and benefiting various downstream applications. The code and improved traffic signal data are open-sourced at https://github.com/michigan-traffic-lab/WOMD-Traffic-Signal-Data-Improvement