Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Sep 16, 2025

Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen, Xidao Luan

Figure 1 for TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Figure 2 for TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Figure 3 for TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Figure 4 for TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Share this with someone who'll enjoy it:

Abstract:Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.

View paper on

Share this with someone who'll enjoy it:

Title:TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Paper and Code