Imagining multiple consecutive frames given one single snapshot is challenging, since it is difficult to simultaneously predict diverse motions from a single image and faithfully generate novel frames without visual distortions. In this work, we leverage an unsupervised variational model to learn rich motion patterns in the form of long-term bi-directional flow fields, and apply the predicted flows to generate high-quality video sequences. In contrast to the state-of-the-art approach, our method does not require external flow supervisions for learning. This is achieved through a novel module that performs bi-directional flows prediction from a single image. In addition, with the bi-directional flow consistency check, our method can handle occlusion and warping artifacts in a principled manner. Our method can be trained end-to-end based on arbitrarily sampled natural video clips, and it is able to capture multi-modal motion uncertainty and synthesizes photo-realistic novel sequences. Quantitative and qualitative evaluations over synthetic and real-world datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods.