The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.
Semantic segmentation of drone images is critical to many aerial vision tasks as it provides essential semantic details that can compensate for the lack of depth information from monocular cameras. However, maintaining high accuracy of semantic segmentation models for drones requires diverse, large-scale, and high-resolution datasets, which are rare in the field of aerial image processing. Existing datasets are typically small and focus primarily on urban scenes, neglecting rural and industrial areas. Models trained on such datasets are not sufficiently equipped to handle the variety of inputs seen in drone imagery. In the VDD-Varied Drone Dataset, we offer a large-scale and densely labeled dataset comprising 400 high-resolution images that feature carefully chosen scenes, camera angles, and varied light and weather conditions. Furthermore, we have adapted existing drone datasets to conform to our annotation standards and integrated them with VDD to create a dataset 1.5 times the size of fine annotation of Cityscapes. We have developed a novel DeepLabT model, which combines CNN and Transformer backbones, to provide a reliable baseline for semantic segmentation in drone imagery. Our experiments indicate that DeepLabT performs admirably on VDD and other drone datasets. We expect that our dataset will generate considerable interest in drone image segmentation and serve as a foundation for other drone vision tasks. VDD is freely available on our website at https://vddvdd.com .