Abstract:Learning adaptive visuomotor policies for embodied agents remains a formidable challenge, particularly when facing cross-embodiment variations such as diverse sensor configurations and dynamic properties. Conventional learning approaches often struggle to separate task-relevant features from domain-specific variations (e.g., lighting, field-of-view, and rotation), leading to poor sample efficiency and catastrophic failure in unseen environments. To bridge this gap, we propose ContrAstive Prompt Orchestration (CAPO), a novel approach for learning visuomotor policies that integrates contrastive prompt learning and adaptive prompt orchestration. For prompt learning, we devise a hybrid contrastive learning strategy that integrates visual, temporal action, and text objectives to establish a pool of learnable prompts, where each prompt induces a visual representation encapsulating fine-grained domain factors. Based on these learned prompts, we introduce an adaptive prompt orchestration mechanism that dynamically aggregates these prompts conditioned on current observations. This enables the agent to adaptively construct optimal state representations by identifying dominant domain factors instantaneously. Consequently, the policy optimization is effectively shielded from irrelevant interference, preventing the common issue of overfitting to source domains. Extensive experiments demonstrate that CAPO significantly outperforms state-of-the-art baselines in sample efficiency and asymptotic performance. Crucially, it exhibits superior zero-shot adaptation across unseen target domains characterized by drastic environmental (e.g., illumination) and physical shifts (e.g., field-of-view and rotation), validating its effectiveness as a viable solution for cross-embodiment visuomotor policy adaptation.




Abstract:Post-flood building damage assessment is critical for rapid response and post-disaster reconstruction planning. Current research fails to consider the distinct requirements of disaster assessment (DA) from change detection (CD) in neural network design. This paper focuses on two key differences: 1) building change features in DA satellite images are more subtle than in CD; 2) DA datasets face more severe data scarcity and label imbalance. To address these issues, in terms of model architecture, the research explores the benchmark performance of attention mechanisms in post-flood DA tasks and introduces Simple Prior Attention UNet (SPAUNet) to enhance the model's ability to recognize subtle changes, in terms of semi-supervised learning (SSL) strategies, the paper constructs four different combinations of image-level label category reference distributions for consistent training. Experimental results on flood events of xBD dataset show that SPAUNet performs exceptionally well in supervised learning experiments, achieving a recall of 79.10\% and an F1 score of 71.32\% for damaged classification, outperforming CD methods. The results indicate the necessity of DA task-oriented model design. SSL experiments demonstrate the positive impact of image-level consistency regularization on the model. Using pseudo-labels to form the reference distribution for consistency training yields the best results, proving the potential of using the category distribution of a large amount of unlabeled data for SSL. This paper clarifies the differences between DA and CD tasks. It preliminarily explores model design strategies utilizing prior attention mechanisms and image-level consistency regularization, establishing new post-flood DA task benchmark methods.