Recently, coded masks have been used to demonstrate a thin form-factor lensless camera, FlatCam, in which a mask is placed immediately on top of a bare image sensor. In this paper, we present an imaging model and algorithm to jointly estimate depth and intensity information in the scene from a single or multiple FlatCams. We use a light field representation to model the mapping of 3D scene onto the sensor in which light rays from different depths yield different modulation patterns. We present a greedy depth pursuit algorithm to search the 3D volume and estimate the depth and intensity of each pixel within the camera field-of-view. We present simulation results to analyze the performance of our proposed model and algorithm with different FlatCam settings.