For autonomous vehicles to navigate in urban environment, the ability to predict the possible future behaviors of surrounding vehicles is essential to increase their safety level by avoiding dangerous situations in advance. The behavior anticipation task is mainly based on two tightly linked cues; surrounding agents' recent motions and scene information. The configuration of the agents may uncover which part of the scene is important, while scene structure determines the influential existing agents. To better present this correlation, we deploy multi-head attention on a joint agents and map context. Moreover, to account for the uncertainty of the future, we use an efficient multi-modal probabilistic trajectory prediction model that learns to extract different joint context features and generate diverse possible trajectories accordingly in one forward pass. Results on the publicly available nuScenes dataset prove that our model achieves the performance of existing methods and generates diverse possible future trajectories compliant with scene structure. Most importantly, the visualization of attention maps reveals some of the underlying prediction logic of our approach which increases its interpretability and reliability to deploy in the real world.