In this paper, we build a two-stage Convolutional Neural Network (CNN) architecture to construct inter- and intra-frame representations based on an arbitrary number of images captured under different light directions, performing accurate normal estimation of non-Lambertian objects. We experimentally investigate numerous network design alternatives for identifying the optimal scheme to deploy inter-frame and intra-frame feature extraction modules for the photometric stereo problem. Moreover, we propose to utilize the easily obtained object mask for eliminating adverse interference from invalid background regions in intra-frame spatial convolutions, thus effectively improve the accuracy of normal estimation for surfaces made of dark materials or with cast shadows. Experimental results demonstrate that proposed masked two-stage photometric stereo CNN model (MT-PS-CNN) performs favorably against state-of-the-art photometric stereo techniques in terms of both accuracy and efficiency. In addition, the proposed method is capable of predicting accurate and rich surface normal details for non-Lambertian objects of complex geometry and performs stably given inputs captured in both sparse and dense lighting distributions.