We present a novel approach to reconstruct large or featureless scenes. Our method jointly estimates camera poses and a room layout from a set of partial reconstructions due to camera tracking interruptions when scanning a large or featureless scene. Unlike the existing methods relying on feature point matching to localize the camera, we exploit the 3D "box" structure of a typical room layout that meets the Manhattan World property. We first estimate a local layout for each partial scan separately and then combine these local layouts to form a globally aligned layout with loop closure. We validate our method quantitatively and qualitatively on real and synthetic scenes of various sizes and complexities. The evaluations and comparisons show superior effectiveness and accuracy of our method.