Query-based object detectors have made significant advancements since the publication of DETR. However, most existing methods still rely on multi-stage encoders and decoders, or a combination of both. Despite achieving high accuracy, the multi-stage paradigm (typically consisting of 6 stages) suffers from issues such as heavy computational burden, prompting us to reconsider its necessity. In this paper, we explore multiple techniques to enhance query-based detectors and, based on these findings, propose a novel model called GOLO (Global Once and Local Once), which follows a two-stage decoding paradigm. Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance. Experimental results on the COCO dataset demonstrate the effectiveness of our approach.
Despite recent advances in semantic segmentation, an inevitable challenge is the performance degradation caused by the domain shift in real application. Current dominant approach to solve this problem is unsupervised domain adaptation (UDA). However, the absence of labeled target data in UDA is overly restrictive and limits performance. To overcome this limitation, a more practical scenario called semi-supervised domain adaptation (SSDA) has been proposed. Existing SSDA methods are derived from the UDA paradigm and primarily focus on leveraging the unlabeled target data and source data. In this paper, we highlight the significance of exploiting the intra-domain information between the limited labeled target data and unlabeled target data, as it greatly benefits domain adaptation. Instead of solely using the scarce labeled data for supervision, we propose a novel SSDA framework that incorporates both inter-domain mixing and intra-domain mixing, where inter-domain mixing mitigates the source-target domain gap and intra-domain mixing enriches the available target domain information. By simultaneously learning from inter-domain mixing and intra-domain mixing, the network can capture more domain-invariant features and promote its performance on the target domain. We also explore different domain mixing operations to better exploit the target domain information. Comprehensive experiments conducted on the GTA5toCityscapes and SYNTHIA2Cityscapes benchmarks demonstrate the effectiveness of our method, surpassing previous methods by a large margin.