Abstract:Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StruXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StruXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StruXLIP.




Abstract:Mobile manipulators are known for their superior mobility over manipulators on fixed bases, offering promising applications in smart industry and housekeeping scenarios. However, the dynamic coupling nature between the mobile base and the manipulator presents challenges for the physical interactive tasks of the mobile manipulator. Current methods suffer from complex modeling processes and poor transferability. To address this, this article presents a novel dynamic model of the manipulator on the mobile base that requires only the manipulator dynamics and the kinematic information of the mobile base. In addition, embedding the dynamic model, an uncertainty and disturbance estimator-based (UDE-based) dynamic motion/force control scheme is proposed for the mobile manipulator, which compensates for the dynamic coupling and other unmodeled uncertainties. Passivity and stability analyses justify the proposed control law. Simulation and experimental results on our mobile manipulator platform demonstrate the feasibility and effectiveness of our proposed methodology.