While image registration has been studied in remote sensing community for decades, registering multimodal data [e.g., optical, light detection and ranging (LiDAR), synthetic aperture radar (SAR), and map] remains a challenging problem because of significant nonlinear intensity differences between such data. To address this problem, we present a novel fast and robust matching framework integrating local descriptors for multimodal registration. In the proposed framework, a local descriptor (such as Histogram of Oriented Gradient (HOG), Local Self-Similarity or Speeded-Up Robust Feature) is first extracted at each pixel to form a pixel-wise feature representation of an image. Then we define a similarity measure based on the feature representation in frequency domain using the Fast Fourier Transform (FFT) technique, followed by a template matching scheme to detect control points between images. We also propose a novel pixel-wise feature representation using orientated gradients of images, which is named channel features of orientated gradients (CFOG). This novel feature is an extension of the pixel-wise HOG descriptor, and outperforms that both in matching performance and computational efficiency. The major advantages of the proposed framework include (1) structural similarity representation using the pixel-wise feature description and (2) high computational efficiency due to the use of FFT. Moreover, we design an automatic registration system for very large-size multimodal images based on the proposed framework. Experimental results obtained on many different types of multimodal images show the superior matching performance of the proposed framework with respect to the state-of-the-art methods and the effectiveness of the designed system, which show very good potential large-size image registration in real applications.