Time-domain audio separation network (TasNet) has achieved remarkable performance in blind source separation (BSS). With signals captured by a microphone array, the spatial information remains to be explored to assist the BSS task. In this paper, we study a 2-stage framework to iteratively refine the estimated signals by combing multi-channel convolutional TasNets (MC-Conv-TasNets) and classic minimum variance distortionless response (MVDR) beamformers. The first stage uses Beam-TasNet to generate estimated single-speaker signals, while the second stage performs guided source separation by additionally using the output from the first stage. The design of the whole framework as well as each stage follows the principle of ``multi-channel input, multi-channel multi-source output'' (MIMMO), which facilitates iterative signal refinement. Experimental results on the spatialized WSJ0-2MIX demonstrate that the proposed framework has achieved an SDR of 20.7 dB, which exceeded the baseline Beam-TasNet by 4.2 dB and narrowed the gap with the oracle signal-based MVDR to 2.9 dB.