Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingyu Li

Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Mar 03, 2025

Kun Zhang, Jingyu Li, Zhe Li, Jingjing Zhang

Figure 1 for Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Figure 2 for Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Figure 3 for Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Figure 4 for Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Abstract:With the rapid growth of multi-modal data from social media, short video platforms, and e-commerce, content-based retrieval has become essential for efficiently searching and utilizing heterogeneous information. Over time, retrieval techniques have evolved from Unimodal Retrieval (UR) to Cross-modal Retrieval (CR) and, more recently, to Composed Multi-modal Retrieval (CMR). CMR enables users to retrieve images or videos by integrating a reference visual input with textual modifications, enhancing search flexibility and precision. This paper provides a comprehensive review of CMR, covering its fundamental challenges, technical advancements, and categorization into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data augmentation, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and external knowledge integration in zero-shot CMR. Additionally, we highlight the application potential of CMR in composed image retrieval, video retrieval, and person retrieval, which have significant implications for e-commerce, online search, and public security. Given its ability to refine and personalize search experiences, CMR is poised to become a pivotal technology in next-generation retrieval systems. A curated list of related works and resources is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval

Via

Access Paper or Ask Questions

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Jan 21, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang(+25 more)

Abstract:This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Via

Access Paper or Ask Questions

Probing Speaker-specific Features in Speaker Representations

Jan 09, 2025

Aemon Yat Fei Chiu, Paco Kei Ching Fung, Roger Tsz Yeung Li, Jingyu Li, Tan Lee

Figure 1 for Probing Speaker-specific Features in Speaker Representations

Figure 2 for Probing Speaker-specific Features in Speaker Representations

Figure 3 for Probing Speaker-specific Features in Speaker Representations

Figure 4 for Probing Speaker-specific Features in Speaker Representations

Abstract:This study explores speaker-specific features encoded in speaker embeddings and intermediate layers of speech self-supervised learning (SSL) models. By utilising a probing method, we analyse features such as pitch, tempo, and energy across prominent speaker embedding models and speech SSL models, including HuBERT, WavLM, and Wav2vec 2.0. The results reveal that speaker embeddings like CAM++ excel in energy classification, while speech SSL models demonstrate superior performance across multiple features due to their hierarchical feature encoding. Intermediate layers effectively capture a mix of acoustic and para-linguistic information, with deeper layers refining these representations. This investigation provides insights into model design and highlights the potential of these representations for downstream applications, such as speaker verification and text-to-speech synthesis, while laying the groundwork for exploring additional features and advanced probing methods.

Via

Access Paper or Ask Questions

Contrastive Representation for Interactive Recommendation

Dec 24, 2024

Jingyu Li, Zhiyong Feng, Dongxiao He, Hongqi Chen, Qinghang Gao, Guoli Wu

Figure 1 for Contrastive Representation for Interactive Recommendation

Figure 2 for Contrastive Representation for Interactive Recommendation

Figure 3 for Contrastive Representation for Interactive Recommendation

Figure 4 for Contrastive Representation for Interactive Recommendation

Abstract:Interactive Recommendation (IR) has gained significant attention recently for its capability to quickly capture dynamic interest and optimize both short and long term objectives. IR agents are typically implemented through Deep Reinforcement Learning (DRL), because DRL is inherently compatible with the dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the large action space and sample inefficiency problem, training DRL recommender agents is challenging. The key point is that useful features cannot be extracted as high-quality representations for the recommender agent to optimize its policy. To tackle this problem, we propose Contrastive Representation for Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level preference ranking features from explicit interaction, and leverages the features to enhance users' representation. Specifically, the CRIR provides representation through one representation network, and refines it through our proposed Preference Ranking Contrastive Learning (PRCL). The key insight of PRCL is that it can perform contrastive learning without relying on computations involving high-level representations or large potential action sets. Furthermore, we also propose a data exploiting mechanism and an agent training mechanism to better adapt CRIR to the DRL backbone. Extensive experiments have been carried out to show our method's superior improvement on the sample efficiency while training an DRL-based IR agent.

* AAAI-2025 Accepted paper

Via

Access Paper or Ask Questions

FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

Nov 30, 2024

Guoli Wu, Zhiyong Feng, Shizhan Chen, Hongyue Wu, Xiao Xue, Jianmao Xiao, Guodong Fan, Hongqi Chen, Jingyu Li

Figure 1 for FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

Figure 2 for FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

Figure 3 for FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

Figure 4 for FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

Abstract:Traditional recommendation systems focus on maximizing user satisfaction by suggesting their favorite items. This user-centric approach may lead to unfair exposure distribution among the providers. On the contrary, a provider-centric design might become unfair to the users. Therefore, this paper proposes a re-ranking model FairSort\footnote{\textbf{Reproducibility:}The code and datasets are available at \url{https://github.com/13543024276/FairSort}} to find a trade-off solution among user-side fairness, provider-side fairness, and personalized recommendations utility. Previous works habitually treat this issue as a knapsack problem, incorporating both-side fairness as constraints. In this paper, we adopt a novel perspective, treating each recommendation list as a runway rather than a knapsack. In this perspective, each item on the runway gains a velocity and runs within a specific time, achieving re-ranking for both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for personalized recommendations by designing a Binary Search approach. This can provide more reliable recommendations compared to the conventional greedy strategy based on the knapsack problem. We further broaden the applicability of FairSort, designing two versions for online and offline recommendation scenarios. Theoretical analysis and extensive experiments on real-world datasets indicate that FairSort can ensure more reliable personalized recommendations while considering fairness for both the provider and user.

Via

Access Paper or Ask Questions

An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems

Nov 18, 2024

Jingyu Li, Aemon Yat Fei Chiu, Tan Lee

Figure 1 for An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems

Figure 2 for An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems

Figure 3 for An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems

Figure 4 for An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems

Abstract:Language mismatch is among the most common and challenging domain mismatches in deploying speaker verification (SV) systems. Adversarial reprogramming has shown promising results in cross-language adaptation for SV. The reprogramming is implemented by padding learnable parameters on the two sides of input speech signals. In this paper, we investigate the relationship between the number of padded parameters and the performance of the reprogrammed models. Sufficient experiments are conducted with different scales of SV models and datasets. The results demonstrate that reprogramming consistently improves the performance of cross-language SV, while the improvement is saturated or even degraded when using larger padding lengths. The performance is mainly determined by the capacity of the original SV models instead of the number of padded parameters. The SV models with larger scales have higher upper bounds in performance and can endure longer padding without performance degradation.

* Accepted by ISCSLP 2024

Via

Access Paper or Ask Questions

Dual-path Collaborative Generation Network for Emotional Video Captioning

Aug 06, 2024

Cheng Ye, Weidong Chen, Jingyu Li, Lei Zhang, Zhendong Mao

Figure 1 for Dual-path Collaborative Generation Network for Emotional Video Captioning

Figure 2 for Dual-path Collaborative Generation Network for Emotional Video Captioning

Figure 3 for Dual-path Collaborative Generation Network for Emotional Video Captioning

Figure 4 for Dual-path Collaborative Generation Network for Emotional Video Captioning

Abstract:Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.

* Acccepted by ACM Multimedia 2024, oral

Via

Access Paper or Ask Questions

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Jun 06, 2024

Zilu Guo, Liuyang Bian, Xuan Huang, Hu Wei, Jingyu Li, Huasheng Ni

Figure 1 for DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Figure 2 for DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Figure 3 for DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Figure 4 for DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Abstract:Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: https://github.com/takaniwa/DSNet.

Via

Access Paper or Ask Questions

Label-efficient Multi-organ Segmentation Method with Diffusion Model

Feb 23, 2024

Yongzhi Huang, Jinxin Zhu, Haseeb Hassan, Liyilei Su, Jingyu Li, Binding Huang

Figure 1 for Label-efficient Multi-organ Segmentation Method with Diffusion Model

Figure 2 for Label-efficient Multi-organ Segmentation Method with Diffusion Model

Figure 3 for Label-efficient Multi-organ Segmentation Method with Diffusion Model

Figure 4 for Label-efficient Multi-organ Segmentation Method with Diffusion Model

Abstract:Accurate segmentation of multiple organs in Computed Tomography (CT) images plays a vital role in computer-aided diagnosis systems. Various supervised-learning approaches have been proposed recently. However, these methods heavily depend on a large amount of high-quality labeled data, which is expensive to obtain in practice. In this study, we present a label-efficient learning approach using a pre-trained diffusion model for multi-organ segmentation tasks in CT images. First, a denoising diffusion model was trained using unlabeled CT data, generating additional two-dimensional (2D) CT images. Then the pre-trained denoising diffusion network was transferred to the downstream multi-organ segmentation task, effectively creating a semi-supervised learning model that requires only a small amount of labeled data. Furthermore, linear classification and fine-tuning decoder strategies were employed to enhance the network's segmentation performance. Our generative model at 256x256 resolution achieves impressive performance in terms of Fr\'echet inception distance, spatial Fr\'echet inception distance, and F1-score, with values of 11.32, 46.93, and 73.1\%, respectively. These results affirm the diffusion model's ability to generate diverse and realistic 2D CT images. Additionally, our method achieves competitive multi-organ segmentation performance compared to state-of-the-art methods on the FLARE 2022 dataset, particularly in limited labeled data scenarios. Remarkably, even with only 1\% and 10\% labeled data, our method achieves Dice similarity coefficients (DSCs) of 71.56\% and 78.51\% after fine-tuning, respectively. The method achieves a DSC score of 51.81\% using just four labeled CT scans. These results demonstrate the efficacy of our approach in overcoming the limitations of supervised learning heavily reliant on large-scale labeled data.

Via

Access Paper or Ask Questions

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

Jan 08, 2024

Yusheng Tian, Jingyu Li, Tan Lee

Abstract:This research is about the creation of personalized synthetic voices for head and neck cancer survivors. It is focused particularly on tongue cancer patients whose speech might exhibit severe articulation impairment. Our goal is to restore normal articulation in the synthesized speech, while maximally preserving the target speaker's individuality in terms of both the voice timbre and speaking style. This is formulated as a task of learning from noisy labels. We propose to augment the commonly used speech reconstruction loss with two additional terms. The first term constitutes a regularization loss that mitigates the impact of distorted articulation in the training speech. The second term is a consistency loss that encourages correct articulation in the generated speech. These additional loss terms are obtained from frame-level articulation scores of original and generated speech, which are derived using a separately trained phone classifier. Experimental results on a real case of tongue cancer patient confirm that the synthetic voice achieves comparable articulation quality to unimpaired natural speech, while effectively maintaining the target speaker's individuality. Audio samples are available at https://myspeechproject.github.io/ArticulationRepair/.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions