The online videos are generated at an unprecedented speed in recent years. As a result, how to generate personalized recommendation from the large volume of videos becomes more and more challenging. In this paper, we propose to extract the non-textual contents from the videos themselves to enhance the personalized video recommendation. The change of the content types makes us study three issues in this paper. The first issue is what non-textual contents are helpful. Considering the users are attracted by the videos in different aspects, multiple audio and visual features are extracted, encoded and transformed to represent the video contents in the recommender system for the first time. The second issue is how to use the non-textual contents to generate accurate personalized recommendation. We reproduce the existing methods and find that they do not perform well with the non-textual contents due to the mismatch between the features and the learning methods. To address this problem, we propose a new method in this paper. Our experiments show that the proposed method is more accurate whether the video content features are non-textual or textual.