Referring Expression Segmentation (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions. However, its development is plagued by the expensive segmentation labels. To address this issue, we propose a new learning task for RES called Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data, e.g., referring points or grounding boxes, for efficient RES training. To accomplish this task, we also propose a novel yet strong baseline method for Omni-RES based on the recently popular teacher-student learning, where where the weak labels are not directly transformed into supervision signals but used as a yardstick to select and refine high-quality pseudo-masks for teacher-student learning. To validate the proposed Omni-RES method, we apply it to a set of state-of-the-art RES models and conduct extensive experiments on a bunch of RES datasets. The experimental results yield the obvious merits of Omni-RES than the fully-supervised and semi-supervised training schemes. For instance, with only 10% fully labeled data, Omni-RES can help the base model achieve 100% fully supervised performance, and it also outperform the semi-supervised alternative by a large margin, e.g., +14.93% on RefCOCO and +14.95% on RefCOCO+, respectively. More importantly, Omni-RES also enable the use of large-scale vision-langauges like Visual Genome to facilitate low-cost RES training, and achieve new SOTA performance of RES, e.g., 80.66 on RefCOCO.
Parameter-efficient transfer learning (PETL) is an emerging research spot aimed at inexpensively adapting large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage costs for various vision tasks by updating or injecting a small number of parameters instead of full fine-tuning. However, we notice that most existing PETL methods still incur non-negligible latency during inference. In this paper, we propose a parameter-efficient and computationally friendly adapter for giant vision models, called RepAdapter. Specifically, we prove that the adaption modules, even with a complex structure, can be seamlessly integrated into most giant vision models via structural re-parameterization. This property makes RepAdapter zero-cost during inference. In addition to computation efficiency, RepAdapter is more effective and lightweight than existing PETL methods due to its sparse structure and our careful deployment. To validate RepAdapter, we conduct extensive experiments on 27 benchmark datasets of three vision tasks, i.e., image and video classifications and semantic segmentation. Experimental results show the superior performance and efficiency of RepAdapter than the state-of-the-art PETL methods. For instance, by updating only 0.6% parameters, we can improve the performance of ViT from 38.8 to 55.1 on Sun397. Its generalizability is also well validated by a bunch of vision models, i.e., ViT, CLIP, Swin-Transformer and ConvNeXt. Our source code is released at https://github.com/luogen1996/RepAdapter.