Abstract:One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
Abstract:The Spiking Neural Network (SNN) has drawn increasing attention for its energy-efficient, event-driven processing and biological plausibility. To train SNNs via backpropagation, surrogate gradients are used to approximate the non-differentiable spike function, but they only maintain nonzero derivatives within a narrow range of membrane potentials near the firing threshold, referred to as the surrogate gradient support width gamma. We identify a major challenge, termed the dilemma of gamma: a relatively large gamma leads to overactivation, characterized by excessive neuron firing, which in turn increases energy consumption, whereas a small gamma causes vanishing gradients and weakens temporal dependencies. To address this, we propose a temporal Inhibitory Leaky Integrate-and-Fire (ILIF) neuron model, inspired by biological inhibitory mechanisms. This model incorporates interconnected inhibitory units for membrane potential and current, effectively mitigating overactivation while preserving gradient propagation. Theoretical analysis demonstrates ILIF effectiveness in overcoming the gamma dilemma, and extensive experiments on multiple datasets show that ILIF improves energy efficiency by reducing firing rates, stabilizes training, and enhances accuracy. The code is available at github.com/kaisun1/ILIF.
Abstract:Despite advances in spiking neural networks (SNNs) in numerous tasks, their architectures remain highly similar to traditional artificial neural networks (ANNs), restricting their ability to mimic natural connections between biological neurons. This paper develops a new modeling paradigm for SNN with random graph architecture (RGA), termed Cognition-aware SNN (CogniSNN). Furthermore, we improve the expandability and neuroplasticity of CogniSNN by introducing a modified spiking residual neural node (ResNode) to counteract network degradation in deeper graph pathways, as well as a critical path-based algorithm that enables CogniSNN to perform continual learning on new tasks leveraging the features of the data and the RGA learned in the old task. Experiments show that CogniSNN with re-designed ResNode performs outstandingly in neuromorphic datasets with fewer parameters, achieving 95.5% precision in the DVS-Gesture dataset with only 5 timesteps. The critical path-based approach decreases 3% to 5% forgetting while maintaining expected performance in learning new tasks that are similar to or distinct from the old ones. This study showcases the potential of RGA-based SNN and paves a new path for biologically inspired networks based on graph theory.
Abstract:Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, \textbf{QLLM}, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of \textbf{TFCAF} is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed \textit{coder-evaluator} framework is further employed to guide the generation, verification, and refinement of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios.
Abstract:Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
Abstract:As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
Abstract:Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
Abstract:In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
Abstract:The ability to automatically encircle boundaries with mobile robots is crucial for tasks such as border tracking and object enclosing. Previous research has primarily focused on regular boundaries, often assuming that their geometric equations are known in advance, which is not often the case in practice. In this paper, we investigate a more general case and propose an algorithm that addresses geometric irregularities of boundaries without requiring prior knowledge of their analytical expressions. To achieve this, we develop a Fourier-based curve fitting method for boundary approximation using sampled points, enabling parametric characterization of general 2-D boundaries. This approach allows star-shaped boundaries to be fitted into polar-angle-based parametric curves, while boundaries of other shapes are handled through decomposition. Then, we design a vector field (VF) to achieve the encirclement of the parameterized boundary, wherein a polar radius error is introduced to measure the robot's ``distance'' to the boundary. The controller is finally synthesized using a control barrier function and quadratic programming to mediate some potentially conflicting specifications: boundary encirclement, obstacle avoidance, and limited actuation. In this manner, the VF-guided reference control not only guides the boundary encircling action, but can also be minimally modified to satisfy obstacle avoidance and input saturation constraints. Simulations and experiments are presented to verify the performance of our new method, which can be applied to mobile robots to perform practical tasks such as cleaning chemical spills and environment monitoring.
Abstract:Ocean forecasting is crucial for both scientific research and societal benefits. Currently, the most accurate forecasting systems are global ocean forecasting systems (GOFSs), which represent the ocean state variables (OSVs) as discrete grids and solve partial differential equations (PDEs) governing the transitions of oceanic state variables using numerical methods. However, GOFSs processes are computationally expensive and prone to cumulative errors. Recently, large artificial intelligence (AI)-based models significantly boosted forecasting speed and accuracy. Unfortunately, building a large AI ocean forecasting system that can be considered cross-spatiotemporal and air-sea coupled forecasts remains a significant challenge. Here, we introduce LangYa, a cross-spatiotemporal and air-sea coupled ocean forecasting system. Results demonstrate that the time embedding module in LangYa enables a single model to make forecasts with lead times ranging from 1 to 7 days. The air-sea coupled module effectively simulates air-sea interactions. The ocean self-attention module improves network stability and accelerates convergence during training, and the adaptive thermocline loss function improves the accuracy of thermocline forecasting. Compared to existing numerical and AI-based ocean forecasting systems, LangYa uses 27 years of global ocean data from the Global Ocean Reanalysis and Simulation version 12 (GLORYS12) for training and achieves more reliable deterministic forecasting results for OSVs. LangYa forecasting system provides global ocean researchers with access to a powerful software tool for accurate ocean forecasting and opens a new paradigm for ocean science.