Alibaba Group




Abstract:The extensive data interaction demands of an immersive metaverse necessitate the adoption of emerging technologies to enable high-capacity communication. Vortex electromagnetic waves with different orbital angular momentum (OAM) modes are spatially orthogonal, providing a novel spatial multiplexing dimension to achieve high-capacity communication. However, the number of orthogonal OAM modes based on a discrete uniform circular array (UCA) is limited by the number of array elements in the UCA, and traditional discrete channel models are unable to accurately capture the physical properties of vortex electromagnetic wave propagation. The continuous-aperture array (CAPA) is composed of densely packed electromagnetic excitation elements, capable of flexibly and efficiently generating the desired surface currents to produce an arbitrary number of mutually orthogonal OAM modes. From the perspective of electromagnetic information theory (EIT), we propose a CAPA-based OAM orthogonal transmission scheme to realize high-capacity communication. We design the surface currents of the CAPA using Fourier basis functions, derive the electromagnetic channel for vortex electromagnetic waves, and investigate the upper bound of the spectrum efficiency for CAPA-based OAM orthogonal transmission. This paper establishes a theoretical foundation for applying EIT to the orthogonal transmission of vortex electromagnetic waves, offering a novel solution for achieving CAPA-based efficient and high-capacity communication.




Abstract:Graph Neural Network (GNN) has demonstrated their superiority in collaborative filtering, where the user-item (U-I) interaction bipartite graph serves as the fundamental data format. However, when graph-structured side information (e.g., multimodal similarity graphs or social networks) is integrated into the U-I bipartite graph, existing graph collaborative filtering methods fall short of achieving satisfactory performance. We quantitatively analyze this problem from a spectral perspective. Recall that a bipartite graph possesses a full spectrum within the range of [-1, 1], with the highest frequency exactly achievable at -1 and the lowest frequency at 1; however, we observe as more side information is incorporated, the highest frequency of the augmented adjacency matrix progressively shifts rightward. This spectrum shift phenomenon has caused previous approaches built for the full spectrum [-1, 1] to assign mismatched importance to different frequencies. To this end, we propose Spectrum Shift Correction (dubbed SSC), incorporating shifting and scaling factors to enable spectral GNNs to adapt to the shifted spectrum. Unlike previous paradigms of leveraging side information, which necessitate tailored designs for diverse data types, SSC directly connects traditional graph collaborative filtering with any graph-structured side information. Experiments on social and multimodal recommendation demonstrate the effectiveness of SSC, achieving relative improvements of up to 23% without incurring any additional computational overhead.




Abstract:Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Abstract:Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
Abstract:Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. In the mainstream modeling strategies, i.e., CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: \emph{1) constructing the customized graph structure for spatial-spectral relationship priors}; \emph{2) learning the unified spatial-spectral representation through the graph}. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named \textbf{HetSSNet}. Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaboratively learn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.




Abstract:In this paper, we introduce DobLIX, a dual-objective learned index specifically designed for Log-Structured Merge(LSM) tree-based key-value stores. Although traditional learned indexes focus exclusively on optimizing index lookups, they often overlook the impact of data access from storage, resulting in performance bottlenecks. DobLIX addresses this by incorporating a second objective, data access optimization, into the learned index training process. This dual-objective approach ensures that both index lookup efficiency and data access costs are minimized, leading to significant improvements in read performance while maintaining write efficiency in real-world LSM-tree systems. Additionally, DobLIX features a reinforcement learning agent that dynamically tunes the system parameters, allowing it to adapt to varying workloads in real-time. Experimental results using real-world datasets demonstrate that DobLIX reduces indexing overhead and improves throughput by 1.19 to 2.21 times compared to state-of-the-art methods within RocksDB, a widely used LSM-tree-based storage engine.




Abstract:Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs the Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this context. However, it often generates videos that lack motion dynamics and imaging details. In this work, we provide a comprehensive theoretical analysis of the variance decay issue present in existing methods, contributing to the loss of details and motion dynamics. Recognizing the critical impact of noise distribution on generation quality, we introduce FreqPrior, a novel noise initialization strategy that refines noise in the frequency domain. Our method features a novel filtering technique designed to address different frequency signals while maintaining the noise prior distribution that closely approximates a standard Gaussian distribution. Additionally, we propose a partial sampling process by perturbing the latent at an intermediate timestep during finding the noise prior, significantly reducing inference time without compromising quality. Extensive experiments on VBench demonstrate that our method achieves the highest scores in both quality and semantic assessments, resulting in the best overall total score. These results highlight the superiority of our proposed noise prior.




Abstract:Mamba, with its advantages of global perception and linear complexity, has been widely applied to identify changes of the target regions within the remote sensing (RS) images captured under complex scenarios and varied conditions. However, existing remote sensing change detection (RSCD) approaches based on Mamba frequently struggle to effectively perceive the inherent locality of change regions as they direct flatten and scan RS images (i.e., the features of the same region of changes are not distributed continuously within the sequence but are mixed with features from other regions throughout the sequence). In this paper, we propose a novel locally adaptive SSM-based approach, termed CD-Lamba, which effectively enhances the locality of change detection while maintaining global perception. Specifically, our CD-Lamba includes a Locally Adaptive State-Space Scan (LASS) strategy for locality enhancement, a Cross-Temporal State-Space Scan (CTSS) strategy for bi-temporal feature fusion, and a Window Shifting and Perception (WSP) mechanism to enhance interactions across segmented windows. These strategies are integrated into a multi-scale Cross-Temporal Locally Adaptive State-Space Scan (CT-LASS) module to effectively highlight changes and refine changes' representations feature generation. CD-Lamba significantly enhances local-global spatio-temporal interactions in bi-temporal images, offering improved performance in RSCD tasks. Extensive experimental results show that CD-Lamba achieves state-of-the-art performance on four benchmark datasets with a satisfactory efficiency-accuracy trade-off. Our code is publicly available at https://github.com/xwmaxwma/rschange.




Abstract:Generative artificial intelligence, particularly through large language models (LLMs), is poised to transform energy optimization and demand side management (DSM) within microgrids. This paper explores the integration of LLMs into energy management, emphasizing their roles in automating the optimization of DSM strategies with electric vehicles. We investigate challenges and solutions associated with DSM and explore the new opportunities presented by leveraging LLMs. Then, We propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization. We present a case study to demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles, highlighting our solution's significant advancements in energy efficiency and user adaptability. This work underscores the potential of LLMs for energy optimization and fosters a new era of intelligent DSM solutions.




Abstract:As the third generation of neural networks, Spiking Neural Networks (SNNs) have gained widespread attention due to their low energy consumption and biological interpretability. Recently, SNNs have made considerable advancements in computer vision. However, efficiently conducting feature extraction and fusion under the spiking characteristics of SNNs for object detection remains a pressing challenge. To address this problem, we propose the SpikSSD, a novel Spiking Single Shot Multibox Detector. Specifically, we design a full-spiking backbone network, MDS-ResNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better spiking feature extraction. Additionally, for spiking feature fusion, we introduce the Spiking Bi-direction Fusion Module (SBFM), which for the first time realizes bi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Experimental results show that SpikSSD achieves 40.8\% mAP on the GEN1 dataset, 76.3\% and 52.4\% mAP@0.5 on VOC 2007 and COCO 2017 datasets respectively with the lowest firing rate, outperforming existing SNN-based approaches at ultralow energy consumption. This work sets a new benchmark for future research in SNN-based object detection. Our code is publicly available in https://github.com/yimeng-fan/SpikSSD.