There are increasing concerns about malicious attacks on autonomous vehicles. In particular, inaudible voice command attacks pose a significant threat as voice commands become available in autonomous driving systems. How to empirically defend against these inaudible attacks remains an open question. Previous research investigates utilizing deep learning-based multimodal fusion for defense, without considering the model uncertainty in trustworthiness. As deep learning has been applied to increasingly sensitive tasks, uncertainty measurement is crucial in helping improve model robustness, especially in mission-critical scenarios. In this paper, we propose the Multimodal Fusion Framework (MFF) as an intelligent security system to defend against inaudible voice command attacks. MFF fuses heterogeneous audio-vision modalities using VGG family neural networks and achieves the detection accuracy of 92.25% in the comparative fusion method empirical study. Additionally, extensive experiments on audio-vision tasks reveal the model's uncertainty. Using Expected Calibration Errors, we measure calibration errors and Monte-Carlo Dropout to estimate the predictive distribution for the proposed models. Our findings show empirically to train robust multimodal models, improve standard accuracy and provide a further step toward interpretability. Finally, we discuss the pros and cons of our approach and its applicability for Advanced Driver Assistance Systems.
With recent advances in autonomous driving, Voice Control Systems have become increasingly adopted as human-vehicle interaction methods. This technology enables drivers to use voice commands to control the vehicle and will be soon available in Advanced Driver Assistance Systems (ADAS). Prior work has shown that Siri, Alexa and Cortana, are highly vulnerable to inaudible command attacks. This could be extended to ADAS in real-world applications and such inaudible command threat is difficult to detect due to microphone nonlinearities. In this paper, we aim to develop a more practical solution by using camera views to defend against inaudible command attacks where ADAS are capable of detecting their environment via multi-sensors. To this end, we propose a novel multimodal deep learning classification system to defend against inaudible command attacks. Our experimental results confirm the feasibility of the proposed defense methods and the best classification accuracy reaches 89.2%. Code is available at https://github.com/ITSEG-MQ/Sensor-Fusion-Against-VoiceCommand-Attacks.