Abstract:Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
Abstract:Resource allocation in the multiple-input multiple-output (MIMO) multiple access channel (MAC) is a fundamental problem in multiuser communications, yet it is increasingly treated as non-convex and computationally intractable. This has motivated a large body of heuristic machine learning and successive-approximation methods. Results here show that the MIMO MAC admits canonical convex formulations and present four solvers that together characterize its capacity region. maxRMAC performs weighted sum-rate maximization under per-user energy constraints, minPMAC finds the minimum weighted energy required to support target rates, maxRESMAC performs weighted sum-rate maximization under a total energy constraint, and admMAC tests rate-region feasibility. The solvers exploit the polymatroid structure of the MAC rate region and the separability of the dual Lagrangian across frequency tones, which reduces the problem to parallel per-tone covariance optimizations solved via limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) over Cholesky-like covariance factors. Experiments on spatially correlated MIMO orthogonal frequency-division multiplexing (OFDM) channels show that the proposed solvers match a commercial convex solver in solution quality while running up to two orders of magnitude faster and scaling to regimes where the commercial solver times out. Through broadcast channel (BC) to MAC duality, the same solvers also enable optimal precoder design for the MIMO BC. All solvers are open-sourced and available at https://github.com/muhd-umer/canonical-mac.
Abstract:Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.




Abstract:Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.
Abstract:Future wireless networks aim to deliver high data rates and lower power consumption while ensuring seamless connectivity, necessitating robust optimization. Large language models (LLMs) have been deployed for generalized optimization scenarios. To take advantage of generative AI (GAI) models, we propose retrieval augmented generation (RAG) for multi-sensor wireless environment perception. Utilizing domain-specific prompt engineering, we apply RAG to efficiently harness multimodal data inputs from sensors in a wireless environment. Key pre-processing pipelines including image-to-text conversion, object detection, and distance calculations for multimodal RAG input from multi-sensor data are proposed to obtain a unified vector database crucial for optimizing LLMs in global wireless tasks. Our evaluation, conducted with OpenAI's GPT and Google's Gemini models, demonstrates an 8%, 8%, 10%, 7%, and 12% improvement in relevancy, faithfulness, completeness, similarity, and accuracy, respectively, compared to conventional LLM-based designs. Furthermore, our RAG-based LLM framework with vectorized databases is computationally efficient, providing real-time convergence under latency constraints.
Abstract:This paper proposes a novel joint channel-estimation and source-detection algorithm using successive interference cancellation (SIC)-aided generative score-based diffusion models. Prior work in this area focuses on massive MIMO scenarios, which are typically characterized by full-rank channels, and fail in low-rank channel scenarios. The proposed algorithm outperforms existing methods in joint source-channel estimation, especially in low-rank scenarios where the number of users exceeds the number of antennas at the access point (AP). The proposed score-based iterative diffusion process estimates the gradient of the prior distribution on partial channels, and recursively updates the estimated channel parts as well as the source. Extensive simulation results show that the proposed method outperforms the baseline methods in terms of normalized mean squared error (NMSE) and symbol error rate (SER) in both full-rank and low-rank channel scenarios, while having a more dominant effect in the latter, at various signal-to-noise ratios (SNR).
Abstract:Upcoming Augmented Reality (AR) and Virtual Reality (VR) systems require high data rates ($\geq$ 500 Mbps) and low power consumption for seamless experience. With an increasing number of subscribing users, the total number of antennas across all transmitting users far exceeds the number of antennas at the access point (AP). This results in a low rank wireless channel, presenting a bottleneck for uplink communication systems. The current uplink systems that use orthogonal multiple access (OMA) and the proposed non-orthogonal multiple access (NOMA), fail to achieve the required data rates / power consumption under predominantly low rank channel scenarios. This paper introduces an optimal power sub carrier allocation algorithm for multi-carrier NOMA, named minPMAC, and an associated time-sharing algorithm that adaptively changes successive interference cancellation decoding orders to maximize sum data rates in these low rank channels. This Lagrangian based optimization technique, although globally optimum, is prohibitive in terms of runtime, proving inefficient for real-time scenarios. Hence, we propose a novel near-optimal deep reinforcement learning-based energy sum optimization (DRL-minPMAC) which achieves real-time efficiency. Extensive experimental evaluations show that minPMAC achieves 28\% and 39\% higher data rates than NOMA and OMA baselines. Furthermore, the proposed DRL-minPMAC runs ~5 times faster than minPMAC and achieves 83\% of the global optimum data rates in real time



Abstract:Currently used resource allocation methods for uplink multicarrier non-orthogonal multiple access (MC-NOMA) systems have multiple shortcomings. Current approaches either allocate the same power across all subcarriers to a user, or use heuristic-based near-far, strong channel-weak channel user grouping to assign the decoding order for successive interference cancellation (SIC). This paper proposes a novel optimal power-subcarrier allocation for uplink MC-NOMA. This new allocation achieves the optimal power-subcarrier allocation as well as the optimal SIC decoding order. Furthermore, the proposed method includes a time-sharing algorithm that dynamically alters the decoding orders of the participating users to achieve the required data rates, even in cases where any single decoding order fails to do so. Extensive experimental evaluations show that the new method achieves higher sum data rates and lower power consumption compared to current NOMA methods.
Abstract:Efficient spectrum allocation has become crucial as the surge in wireless-connected devices demands seamless support for more users and applications, a trend expected to grow with 6G. Innovations in satellite technologies such as SpaceX's Starlink have enabled non-terrestrial networks (NTNs) to work alongside terrestrial networks (TNs) and allocate spectrum based on regional demands. Existing spectrum sharing approaches in TNs use machine learning for interference minimization through power allocation and spectrum sensing, but the unique characteristics of NTNs like varying orbital dynamics and coverage patterns require more sophisticated coordination mechanisms. The proposed work uses a hierarchical deep reinforcement learning (HDRL) approach for efficient spectrum allocation across TN-NTN networks. DRL agents are present at each TN-NTN hierarchy that dynamically learn and allocate spectrum based on regional trends. This framework is 50x faster than the exhaustive search algorithm while achieving 95\% of optimum spectral efficiency. Moreover, it is 3.75x faster than multi-agent DRL, which is commonly used for spectrum sharing, and has a 12\% higher overall average throughput.
Abstract:This paper introduces a novel power allocation and subcarrier optimization algorithm tailored for fixed wireless access (FWA) networks operating under low-rank channel conditions, where the number of subscriber antennas far exceeds those at the base station (BS). As FWA networks grow to support more users, traditional approaches like orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) struggle to maintain high data rates and energy efficiency due to the limited degrees of freedom in low-rank scenarios. Our proposed solution addresses this by combining optimal power-subcarrier allocation with an adaptive time-sharing algorithm that dynamically adjusts decoding orders to optimize performance across multiple users. The algorithm leverages a generalized decision feedback equalizer (GDFE) approach to effectively manage inter-symbol interference and crosstalk, leading to superior data rates and energy savings. Simulation results demonstrate that our approach significantly outperforms existing OMA and NOMA baselines, particularly in low-rank conditions, with substantial gains in both data rate and energy efficiency. The findings highlight the potential of this method to meet the growing demand for scalable, high-performance FWA networks.