Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyu Li

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Dec 10, 2024

Yicheng Wang, Zhikang Zhang, Jue Wang, David Fan, Zhenlin Xu, Linda Liu, Xiang Hao, Vimal Bhat, Xinyu Li

Figure 1 for GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Figure 2 for GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Figure 3 for GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Figure 4 for GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Abstract:In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance. Remarkably, our model excels in tasks involving long-form video understanding, even though the pretraining dataset only contains short video clips.

Via

Access Paper or Ask Questions

Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

Nov 28, 2024

Zhanfeng Wang, Xinyu Li, Jian Qing Shi

Figure 1 for Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

Figure 2 for Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

Figure 3 for Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

Figure 4 for Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

Abstract:In this paper, we propose a novel intrinsic wrapped Gaussian process regression model for response variable measured on Riemannian manifold. We apply the parallel transport operator to define an intrinsic covariance structure addressing a critical aspect of constructing a well defined Gaussian process regression model. We show that the posterior distribution of regression function is invariant to the choice of orthonormal frames for the coordinate representations of the covariance function. This method can be applied to data situated not only on Euclidean submanifolds but also on manifolds without a natural ambient space. The asymptotic properties for estimating the posterior distribution is established. Numerical studies, including simulation and real-world examples, indicate that the proposed method delivers strong performance.

Via

Access Paper or Ask Questions

Video Token Merging for Long-form Video Understanding

Oct 31, 2024

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li

Figure 1 for Video Token Merging for Long-form Video Understanding

Figure 2 for Video Token Merging for Long-form Video Understanding

Figure 3 for Video Token Merging for Long-form Video Understanding

Figure 4 for Video Token Merging for Long-form Video Understanding

Abstract:As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

* NeurIPS 2024
* 21 pages, NeurIPS 2024

Via

Access Paper or Ask Questions

Automatic programming via large language models with population self-evolution for dynamic job shop scheduling problem

Oct 30, 2024

Jin Huang, Xinyu Li, Liang Gao, Qihao Liu, Yue Teng

Abstract:Heuristic dispatching rules (HDRs) are widely regarded as effective methods for solving dynamic job shop scheduling problems (DJSSP) in real-world production environments. However, their performance is highly scenario-dependent, often requiring expert customization. To address this, genetic programming (GP) and gene expression programming (GEP) have been extensively used for automatic algorithm design. Nevertheless, these approaches often face challenges due to high randomness in the search process and limited generalization ability, hindering the application of trained dispatching rules to new scenarios or dynamic environments. Recently, the integration of large language models (LLMs) with evolutionary algorithms has opened new avenues for prompt engineering and automatic algorithm design. To enhance the capabilities of LLMs in automatic HDRs design, this paper proposes a novel population self-evolutionary (SeEvo) method, a general search framework inspired by the self-reflective design strategies of human experts. The SeEvo method accelerates the search process and enhances exploration capabilities. Experimental results show that the proposed SeEvo method outperforms GP, GEP, end-to-end deep reinforcement learning methods, and more than 10 common HDRs from the literature, particularly in unseen and dynamic scenarios.

Via

Access Paper or Ask Questions

MambaEviScrib: Mamba and Evidence-Guided Consistency Make CNN Work Robustly for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Sep 28, 2024

Xiaoxiang Han, Xinyu Li, Jiang Shang, Yiman Liu, Keyan Chen, Qiaohong Liu, Qi Zhang

Figure 1 for MambaEviScrib: Mamba and Evidence-Guided Consistency Make CNN Work Robustly for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Figure 2 for MambaEviScrib: Mamba and Evidence-Guided Consistency Make CNN Work Robustly for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Figure 3 for MambaEviScrib: Mamba and Evidence-Guided Consistency Make CNN Work Robustly for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Figure 4 for MambaEviScrib: Mamba and Evidence-Guided Consistency Make CNN Work Robustly for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Abstract:Segmenting anatomical structures and lesions from ultrasound images contributes to disease assessment, diagnosis, and treatment. Weakly supervised learning (WSL) based on sparse annotation has achieved encouraging performance and demonstrated the potential to reduce annotation costs. However, ultrasound images often suffer from issues such as poor contrast, unclear edges, as well as varying sizes and locations of lesions. This makes it challenging for convolutional networks with local receptive fields to extract global morphological features from the sparse information provided by scribble annotations. Recently, the visual Mamba based on state space sequence models (SSMs) has significantly reduced computational complexity while ensuring long-range dependencies compared to Transformers. Consequently, for the first time, we apply scribble-based WSL to ultrasound image segmentation and propose a novel hybrid CNN-Mamba framework. Furthermore, due to the characteristics of ultrasound images and insufficient supervision signals, existing consistency regularization often filters out predictions near decision boundaries, leading to unstable predictions of edges. Hence, we introduce the Dempster-Shafer theory (DST) of evidence to devise an Evidence-Guided Consistency (EGC) strategy, which leverages high-evidence predictions more likely to occur near high-density regions to guide low-evidence predictions potentially present near decision boundaries for optimization. During training, the collaboration between the CNN branch and the Mamba branch in the proposed framework draws inspiration from each other based on the EGC strategy. Extensive experiments on four ultrasound public datasets for binary-class and multi-class segmentation demonstrate the competitiveness of the proposed method. The scribble-annotated dataset and code will be made available on https://github.com/GtLinyer/MambaEviScrib.

Via

Access Paper or Ask Questions

Benchmarking Sub-Genre Classification For Mainstage Dance Music

Sep 10, 2024

Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li

Figure 1 for Benchmarking Sub-Genre Classification For Mainstage Dance Music

Figure 2 for Benchmarking Sub-Genre Classification For Mainstage Dance Music

Figure 3 for Benchmarking Sub-Genre Classification For Mainstage Dance Music

Figure 4 for Benchmarking Sub-Genre Classification For Mainstage Dance Music

Abstract:Music classification, with a wide range of applications, is one of the most prominent tasks in music information retrieval. To address the absence of comprehensive datasets and high-performing methods in the classification of mainstage dance music, this work introduces a novel benchmark comprising a new dataset and a baseline. Our dataset extends the number of sub-genres to cover most recent mainstage live sets by top DJs worldwide in music festivals. A continuous soft labeling approach is employed to account for tracks that span multiple sub-genres, preserving the inherent sophistication. For the baseline, we developed deep learning models that outperform current state-of-the-art multimodel language models, which struggle to identify house music sub-genres, emphasizing the need for specialized models trained on fine-grained datasets. Our benchmark is applicable to serve for application scenarios such as music recommendation, DJ set curation, and interactive multimedia, where we also provide video demos. Our code is on \url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

Aug 19, 2024

Xinyu Li, Chuang Zhao, Hongke Zhao, Likang Wu, Ming HE

Figure 1 for GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

Figure 2 for GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

Figure 3 for GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

Figure 4 for GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

Abstract:In recent years, LLM has demonstrated remarkable proficiency in comprehending and generating natural language, with a growing prevalence in the domain of recommender systems. However, LLM continues to face a significant challenge in that it is highly susceptible to the influence of prompt words. This inconsistency in response to minor alterations in prompt input may compromise the accuracy and resilience of recommendation models. To address this issue, this paper proposes GANPrompt, a multi-dimensional large language model prompt diversity framework based on Generative Adversarial Networks (GANs). The framework enhances the model's adaptability and stability to diverse prompts by integrating GAN generation techniques with the deep semantic understanding capabilities of LLMs. GANPrompt first trains a generator capable of producing diverse prompts by analysing multidimensional user behavioural data. These diverse prompts are then used to train the LLM to improve its performance in the face of unseen prompts. Furthermore, to ensure a high degree of diversity and relevance of the prompts, this study introduces a mathematical theory-based diversity constraint mechanism that optimises the generated prompts to ensure that they are not only superficially distinct, but also semantically cover a wide range of user intentions. Through extensive experiments on multiple datasets, we demonstrate the effectiveness of the proposed framework, especially in improving the adaptability and robustness of recommender systems in complex and dynamic environments. The experimental results demonstrate that GANPrompt yields substantial enhancements in accuracy and robustness relative to existing state-of-the-art methodologies.

Via

Access Paper or Ask Questions

Text-Guided Video Masked Autoencoder

Aug 01, 2024

David Fan, Jue Wang, Shuai Liao, Zhikang Zhang, Vimal Bhat, Xinyu Li

Figure 1 for Text-Guided Video Masked Autoencoder

Figure 2 for Text-Guided Video Masked Autoencoder

Figure 3 for Text-Guided Video Masked Autoencoder

Figure 4 for Text-Guided Video Masked Autoencoder

Abstract:Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. Within this unified framework, our TGM achieves the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language for masked video modeling.

* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback

Jun 18, 2024

Guipeng Xv, Xinyu Li, Ruobing Xie, Chen Lin, Chong Liu, Feng Xia, Zhanhui Kang, Leyu Lin

Figure 1 for Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback

Figure 2 for Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback

Figure 3 for Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback

Figure 4 for Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback

Abstract:Multi-modal recommender systems (MRSs) are pivotal in diverse online web platforms and have garnered considerable attention in recent years. However, previous studies overlook the challenges of (1) noisy multi-modal content, (2) noisy user feedback, and (3) aligning multi-modal content with user feedback. In order to tackle these challenges, we propose Denoising and Aligning Multi-modal Recommender System (DA-MRS). To mitigate multi-modal noise, DA-MRS first constructs item-item graphs determined by consistent content similarity across modalities. To denoise user feedback, DA-MRS associates the probability of observed feedback with multi-modal content and devises a denoised BPR loss. Furthermore, DA-MRS implements Alignment guided by User preference to enhance task-specific item representation and Alignment guided by graded Item relations to provide finer-grained alignment. Extensive experiments verify that DA-MRS is a plug-and-play framework and achieves significant and consistent improvements across various datasets, backbone models, and noisy scenarios.

Via

Access Paper or Ask Questions

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Jun 04, 2024

Mingda Li, Xinyu Li, Yifan Chen, Wenfeng Xuan, Weinan Zhang

Figure 1 for Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Figure 2 for Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Figure 3 for Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Figure 4 for Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Abstract:Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.

* ACL 2024 (findings)

Via

Access Paper or Ask Questions