Federated Learning (FL) is capable of leveraging massively distributed private data, e.g., on mobile phones and IoT devices, to collaboratively train a shared machine learning model with the help of a cloud server. However, its iterative training process results in intolerable communication latency, and causes huge burdens on the backbone network. Thus, reducing the communication overhead is critical to implement FL in practice. Meanwhile, the model performance degradation due to the unique non-IID data distribution at different devices is another big issue for FL. In this paper, by introducing the mobile edge computing platform as an intermediary structure, we propose a hierarchical FL architecture to reduce the communication rounds between users and the cloud. In particular, a Hierarchical Federated Averaging (HierFAVG) algorithm is proposed, which allows multiple local aggregations at each edge server before one global aggregation at the cloud. We establish the convergence of HierFAVG for both convex and non-convex objective functions with non-IID user data. It is demonstrated that HierFAVG can reach a desired model performance with less communication, and outperform the traditional Federated Averaging algorithm.