The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness can be guaranteed with sufficient spatial observations (typically 0.01% to 1% of pixels) in each map, to model local dependency of observed signal power. However, such a setting of sufficient measurements can be less practical in real-world scenarios, where extreme sparsity in spatial sampling can be widely encountered. To address this challenge, we propose RadioFormer, a novel multiple-granularity transformer designed to handle the constraints posed by spatial sparse observations. Our RadioFormer, through a dual-stream self-attention (DSA) module, can respectively discover the correlation of pixel-wise observed signal power and also learn patch-wise buildings' geometries in a style of multiple granularities, which are integrated into multi-scale representations of radio maps by a cross stream cross-attention (CCA) module. Extensive experiments on the public RadioMapSeer dataset demonstrate that RadioFormer outperforms state-of-the-art methods in radio map estimation while maintaining the lowest computational cost. Furthermore, the proposed approach exhibits exceptional generalization capabilities and robust zero-shot performance, underscoring its potential to advance radio map estimation in a more practical setting with very limited observation nodes.