Heart rate (HR) is an important physiological signal that reflects the physical and emotional activities of humans. Traditional HR measurements are mainly based on contact monitors, which are inconvenient and may cause discomfort for the subjects. Recently, methods have been proposed for remote HR estimation from face videos. However, most of the existing methods focus on well-controlled scenarios, their generalization ability into less-constrained scenarios are not known. At the same time, lacking large-scale databases has limited the use of deep representation learning methods in remote HR estimation. In this paper, we introduce a large-scale multi-modal HR database (named as VIPL-HR), which contains 2,451 visible light videos (VIS) and 752 near-infrared (NIR) videos of 107 subjects. Our VIPL-HR database also contains various variations such as head movements, illumination variations, and acquisition device changes. We also learn a deep HR estimator (named as RhythmNet) with the proposed spatial-temporal representation, which achieves promising results on both the public-domain and our VIPL-HR HR estimation databases. We would like to put the VIPL-HR database into the public domain.