Abstract:Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
Abstract:Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.




Abstract:Bid optimization for online advertising from single advertiser's perspective has been thoroughly investigated in both academic research and industrial practice. However, existing work typically assume competitors do not change their bids, i.e., the wining price is fixed, leading to poor performance of the derived solution. Although a few studies use multi-agent reinforcement learning to set up a cooperative game, they still suffer the following drawbacks: (1) They fail to avoid collusion solutions where all the advertisers involved in an auction collude to bid an extremely low price on purpose. (2) Previous works cannot well handle the underlying complex bidding environment, leading to poor model convergence. This problem could be amplified when handling multiple objectives of advertisers which are practical demands but not considered by previous work. In this paper, we propose a novel multi-objective cooperative bid optimization formulation called Multi-Agent Cooperative bidding Games (MACG). MACG sets up a carefully designed multi-objective optimization framework where different objectives of advertisers are incorporated. A global objective to maximize the overall profit of all advertisements is added in order to encourage better cooperation and also to protect self-bidding advertisers. To avoid collusion, we also introduce an extra platform revenue constraint. We analyze the optimal functional form of the bidding formula theoretically and design a policy network accordingly to generate auction-level bids. Then we design an efficient multi-agent evolutionary strategy for model optimization. Offline experiments and online A/B tests conducted on the Taobao platform indicate both single advertiser's objective and global profit have been significantly improved compared to state-of-art methods.




Abstract:Bidding optimization is one of the most critical problems in online advertising. Sponsored search (SS) auction, due to the randomness of user query behavior and platform nature, usually adopts keyword-level bidding strategies. In contrast, the display advertising (DA), as a relatively simpler scenario for auction, has taken advantage of real-time bidding (RTB) to boost the performance for advertisers. In this paper, we consider the RTB problem in sponsored search auction, named SS-RTB. SS-RTB has a much more complex dynamic environment, due to stochastic user query behavior and more complex bidding policies based on multiple keywords of an ad. Most previous methods for DA cannot be applied. We propose a reinforcement learning (RL) solution for handling the complex dynamic environment. Although some RL methods have been proposed for online advertising, they all fail to address the "environment changing" problem: the state transition probabilities vary between two days. Motivated by the observation that auction sequences of two days share similar transition patterns at a proper aggregation level, we formulate a robust MDP model at hour-aggregation level of the auction data and propose a control-by-model framework for SS-RTB. Rather than generating bid prices directly, we decide a bidding model for impressions of each hour and perform real-time bidding accordingly. We also extend the method to handle the multi-agent problem. We deployed the SS-RTB system in the e-commerce search auction platform of Alibaba. Empirical experiments of offline evaluation and online A/B test demonstrate the effectiveness of our method.