Image inpainting is an important task in computer vision. As admirable methods are presented, the inpainted image is getting closer to reality. However, the result is still not good enough in the reconstructed texture and structure based on human vision. Although more and more larger models have been proposed recently because of the advancement of computer hardware, we would like to build a suitable model for personal use or small-sized institution. Therefore, we propose a lightweight model that combines the special transformer and the traditional convolutional neural network (CNN). Furthermore, we noticed most researchers only consider three primary colors (RGB) in inpainted images, but we think this is not enough so we propose a new loss function to intensify color details. Extensive experiments on commonly seen datasets (Places2 and CelebA) validate the efficacy of our proposed model compared with other state-of-the-art methods. Index Terms - HSV color space, image inpainting, joint attention mechanism, stripe window, vision transformer