Abstract:Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at https://github.com/JianXu95/RALLRec.




Abstract:In real-world recommender systems, different tasks are typically addressed using supervised learning on task-specific datasets with carefully designed model architectures. We demonstrate that large language models (LLMs) can function as universal recommendation learners, capable of handling multiple tasks within a unified input-output framework, eliminating the need for specialized model designs. To improve the recommendation performance of LLMs, we introduce a multimodal fusion module for item representation and a sequence-in-set-out approach for efficient candidate generation. When applied to industrial-scale data, our LLM achieves competitive results with expert models elaborately designed for different recommendation tasks. Furthermore, our analysis reveals that recommendation outcomes are highly sensitive to text input, highlighting the potential of prompt engineering in optimizing industrial-scale recommender systems.
Abstract:Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.




Abstract:Decision-making in large-scale games is an essential research area in artificial intelligence (AI) with significant real-world impact. However, the limited access to realistic large-scale game environments has hindered research progress in this area. In this paper, we present \textbf{AuctionNet}, a benchmark for bid decision-making in large-scale ad auctions derived from a real-world online advertising platform. AuctionNet is composed of three parts: an ad auction environment, a pre-generated dataset based on the environment, and performance evaluations of several baseline bid decision-making algorithms. More specifically, the environment effectively replicates the integrity and complexity of real-world ad auctions through the interaction of several modules: the ad opportunity generation module employs deep generative models to bridge the gap between simulated and real-world data while mitigating the risk of sensitive data exposure; the bidding module implements diverse auto-bidding agents trained with different decision-making algorithms; and the auction module is anchored in the classic Generalized Second Price (GSP) auction but also allows for customization of auction mechanisms as needed. To facilitate research and provide insights into the game environment, we have also pre-generated a substantial dataset based on the environment. The dataset contains trajectories involving 48 diverse agents competing with each other, totaling over 500 million records and occupying 80GB of storage. Performance evaluations of baseline algorithms such as linear programming, reinforcement learning, and generative models for bid decision-making are also presented as part of AuctionNet. We note that AuctionNet is applicable not only to research on bid decision-making algorithms in ad auctions but also to the general area of decision-making in large-scale games.
Abstract:Effective query-item relevance modeling is pivotal for enhancing user experience and safeguarding user satisfaction in e-commerce search systems. Recently, benefiting from the vast inherent knowledge, Large Language Model (LLM) approach demonstrates strong performance and long-tail generalization ability compared with previous neural-based specialized relevance learning methods. Though promising, current LLM-based methods encounter the following inadequacies in practice: First, the massive parameters and computational demands make it difficult to be deployed online. Second, distilling LLM models to online models is a feasible direction, but the LLM relevance modeling is a black box, and its rich intrinsic knowledge is difficult to extract and apply online. To improve the interpretability of LLM and boost the performance of online relevance models via LLM, we propose an Explainable LLM-driven Multi-dimensional Distillation framework for e-commerce relevance learning, which comprises two core components: (1) An Explainable LLM for relevance modeling (ELLM-rele), which decomposes the relevance learning into intermediate steps and models relevance learning as a Chain-of-Thought (CoT) reasoning, thereby enhancing both interpretability and performance of LLM. (2) A Multi-dimensional Knowledge Distillation (MKD) architecture that transfers the knowledge of ELLM-rele to current deployable interaction-based and representation-based student models from both the relevance score distribution and CoT reasoning aspects. Through distilling the probabilistic and CoT reasoning knowledge, MKD improves both the semantic interaction and long-tail generalization abilities of student models. Extensive offline evaluations and online experiments on Taobao search ad scene demonstrate that our proposed framework significantly enhances e-commerce relevance learning performance and user experience.




Abstract:Time series forecasting remains a critical challenge across various domains, often complicated by high-dimensional data and long-term dependencies. This paper presents a novel transformer architecture for time series forecasting, incorporating two key innovations: parameter sharing (PS) and Spatial-Temporal Segment Attention (SegAtt). We also define the time series segment as the concatenation of sequence patches from the same positions across different variables. The proposed model, PSformer, reduces the number of training parameters through the parameter sharing mechanism, thereby improving model efficiency and scalability. The introduction of SegAtt could enhance the capability of capturing local spatio-temporal dependencies by computing attention over the segments, and improve global representation by integrating information across segments. The combination of parameter sharing and SegAtt significantly improves the forecasting performance. Extensive experiments on benchmark datasets demonstrate that PSformer outperforms popular baselines and other transformer-based approaches in terms of accuracy and scalability, establishing itself as an accurate and scalable tool for time series forecasting.




Abstract:Gross Primary Productivity (GPP), the amount of carbon plants fixed by photosynthesis, is pivotal for understanding the global carbon cycle and ecosystem functioning. Process-based models built on the knowledge of ecological processes are susceptible to biases stemming from their assumptions and approximations. These limitations potentially result in considerable uncertainties in global GPP estimation, which may pose significant challenges to our Net Zero goals. This study presents UFLUX v2.0, a process-informed model that integrates state-of-art ecological knowledge and advanced machine learning techniques to reduce uncertainties in GPP estimation by learning the biases between process-based models and eddy covariance (EC) measurements. In our findings, UFLUX v2.0 demonstrated a substantial improvement in model accuracy, achieving an R^2 of 0.79 with a reduced RMSE of 1.60 g C m^-2 d^-1, compared to the process-based model's R^2 of 0.51 and RMSE of 3.09 g C m^-2 d^-1. Our global GPP distribution analysis indicates that while UFLUX v2.0 and the process-based model achieved similar global total GPP (137.47 Pg C and 132.23 Pg C, respectively), they exhibited large differences in spatial distribution, particularly in latitudinal gradients. These differences are very likely due to systematic biases in the process-based model and differing sensitivities to climate and environmental conditions. This study offers improved adaptability for GPP modelling across diverse ecosystems, and further enhances our understanding of global carbon cycles and its responses to environmental changes.




Abstract:Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. However, due to the significant differences between the two modalities, preserving key features during the fusion process remains a challenge. To address this issue, we propose a dual-branch feature decomposition fusion network (DAF-Net) with domain adaptive, which introduces Multi-Kernel Maximum Mean Discrepancy (MK-MMD) into the base encoder and designs a hybrid kernel function suitable for infrared and visible image fusion. The base encoder built on the Restormer network captures global structural information while the detail encoder based on Invertible Neural Networks (INN) focuses on extracting detail texture information. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images. Experimental results demonstrate that the proposed method outperforms existing techniques across multiple datasets, significantly enhancing both visual quality and fusion performance. The related Python code is available at https://github.com/xujian000/DAF-Net.




Abstract:Federated Learning (FL) offers a decentralized approach to model training, where data remains local and only model parameters are shared between the clients and the central server. Traditional methods, such as Federated Averaging (FedAvg), linearly aggregate these parameters which are usually trained on heterogeneous data distributions, potentially overlooking the complex, high-dimensional nature of the parameter space. This can result in degraded performance of the aggregated model. While personalized FL approaches can mitigate the heterogeneous data issue to some extent, the limitation of linear aggregation remains unresolved. To alleviate this issue, we investigate the generative approach of diffusion model and propose a novel generative parameter aggregation framework for personalized FL, \texttt{pFedGPA}. In this framework, we deploy a diffusion model on the server to integrate the diverse parameter distributions and propose a parameter inversion method to efficiently generate a set of personalized parameters for each client. This inversion method transforms the uploaded parameters into a latent code, which is then aggregated through denoising sampling to produce the final personalized parameters. By encoding the dependence of a client's model parameters on the specific data distribution using the high-capacity diffusion model, \texttt{pFedGPA} can effectively decouple the complexity of the overall distribution of all clients' model parameters from the complexity of each individual client's parameter distribution. Our experimental results consistently demonstrate the superior performance of the proposed method across multiple datasets, surpassing baseline approaches.




Abstract:Recently, multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. In the MM-LLMs framework, the main computational consumption step is the processing of concatenated text and visual tokens at the LLM layer. The length of the input token for LLM directly affects the overall training and inference efficiency. In response to this issue, we further studied the visual tokens of MM-LLMs. We found that the similarity between visual and CLS tokens in the visual encoder follows a long-tail distribution. In other words, only a few visual tokens are highly similar to CLS tokens. Therefore, we designed a dynamic pruning algorithm to address this issue. Firstly, for different input samples, we search for the inflection point of their visual CLS token similarity curve and use it as the corresponding segmentation point to trim the visual markers. This process mainly reduces the output of the visual encoder to accelerate the model. Then, in the LLM layer, the concatenated visual text tokens are pruned for the second time. During this process, due to the interaction between visual and textual features, visual and textual tokens with low text correlation are further filtered, achieving a balance between efficiency and performance. The results on multiple datasets show that our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity. Our source code will be made publicly available following acceptance.