Joint image filters leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods either rely on various explicit filter constructions or hand-designed objective functions, thereby making it difficult to understand, improve, and accelerate these filters in a coherent framework. In this paper, we propose a learning-based approach for constructing joint filters based on Convolutional Neural Networks. In contrast to existing methods that consider only the guidance image, the proposed algorithm can selectively transfer salient structures that are consistent with both guidance and target images. We show that the model trained on a certain type of data, e.g., RGB and depth images, generalizes well to other modalities, e.g., flash/non-Flash and RGB/NIR images. We validate the effectiveness of the proposed joint filter through extensive experimental evaluations with state-of-the-art methods.