Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Aug 01, 2025

Kaiyan Zhao, Zhongtao Miao, Yoshimasa Tsuruoka

Share this with someone who'll enjoy it:

Abstract:Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

* Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Paper and Code