Blind modulation classification is an important step to implement cognitive radio networks. The multiple-input multiple-output (MIMO) technique is widely used in military and civil communication systems. Due to the lack of prior information about channel parameters and the overlapping of signals in the MIMO systems, the traditional likelihood-based and feature-based approaches cannot be applied in these scenarios directly. Hence, in this paper, to resolve the problem of blind modulation classification in MIMO systems, the time-frequency analysis method based on the windowed short-time Fourier transform is used to analyse the time-frequency characteristics of time-domain modulated signals. Then the extracted time-frequency characteristics are converted into RGB spectrogram images, and the convolutional neural network based on transfer learning is applied to classify the modulation types according to the RGB spectrogram images. Finally, a decision fusion module is used to fuse the classification results of all the receive antennas. Through simulations, we analyse the classification performance at different signal-to-noise ratios (SNRs), the results indicate that, for the single-input single-output (SISO) network, our proposed scheme can achieve 92.37% and 99.12% average classification accuracy at SNRs of -4 dB and 10 dB, respectively. For the MIMO network, our scheme achieves 80.42% and 87.92% average classification accuracy at -4 dB and 10 dB, respectively. This outperforms the existing classification methods based on baseband signals.