Abstract:Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently outperforms strong baselines in both Rank-1 accuracy and mAP, yielding notable improvements in cross-domain and incomplete-query scenarios. These results highlight CONQUER as a practical and effective solution for real-world TBPS deployment. Source code is available at https://github.com/zqxie77/CONQUER.
Abstract:The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
Abstract:Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at https://github.com/boyun-zhang/HVP-Net.
Abstract:Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
Abstract:Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.