While training an end-to-end navigation network in the real world is usually of high cost, simulations provide a safe and cheap environment in this training stage. However, training neural network models in simulations brings up the problem of how to effectively transfer the model from simulations to the real world (sim-to-real). We regard the environment representation as a crucial element in this transfer process. In this work, we propose a visual information pyramid (VIP) theory to systematically investigate a practical environment representation. A representation composed of spatial and semantic information synthesis is established based on this theory. Specifically, the spatial information is presented by a noise-model-assisted depth image while the semantic information is expressed with a categorized detection image. To explore the effectiveness of this representation, we first extract different representations from a same dataset collected from expert operations, then feed them to the same or very similar neural networks to train the network parameters, and finally evaluate the trained neural networks in simulated and real world navigation tasks. Results suggest that our proposed environment representation behaves best compared with representations popularly used in the literature. With mere one-hour-long training data collected from simulation, the network model trained with our representation can successfully navigate the robot in various scenarios with obstacles. Furthermore, an analysis on the feature map is implemented to investigate the effectiveness through inner reaction, which could be irradiative for future researches on end-to-end navigation.