Radio map estimation (RME), which predicts wireless signal metrics at unmeasured locations from sparse measurements, has attracted growing attention as a key enabler of intelligent wireless networks. The majority of existing RME techniques employ grid-based strategies to process sparse measurements, where the pursuit of accuracy results in significant computational inefficiency and inflexibility for off-grid prediction. In contrast, grid-free approaches directly exploit coordinate features to capture location-specific spatial dependencies, enabling signal prediction at arbitrary locations without relying on predefined grids. However, current grid-free approaches demand substantial preprocessing overhead for constructing the spatial representation, leading to high complexity and constrained adaptability. To address these limitations, this paper proposes a novel cross-attention grid-free based transformer model for RME. We introduce a lightweight spatial embedding module that incorporates environmental knowledge into high-dimensional feature construction. A cross-attention transformer then models the spatial correlation between target and measurement points. The simulation results demonstrate that our proposed method reduces RMSE by up to 6%, outperforming grid-based and gridfree baselines.