Abstract:Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.
Abstract:The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]{https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html}, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlink{https://huggingface.co/datasets/SGJQovo/AgentRecBench}{https://huggingface.co/datasets/SGJQovo/AgentRecBench}.
Abstract:The new mid-band (6-24 GHz) has attracted significant attention from both academia and industry, which is the spectrum with continuous bandwidth that combines the coverage benefits of low frequency with the capacity advantages of high frequency. Since outdoor environments represent the primary application scenario for mobile communications, this paper presents the first comprehensive review and summary of multi-scenario and multi-frequency channel characteristics based on extensive outdoor new mid-band channel measurement data, including UMa, UMi, and O2I. Specifically, a survey of the progress of the channel characteristics is presented, such as path loss, delay spread, angular spread, channel sparsity, capacity and near-field spatial non-stationary characteristics. Then, considering that satellite communication will be an important component of future communication systems, we examine the impact of clutter loss in air-ground communications. Our analysis of the frequency dependence of mid-band clutter loss suggests that its impact is not significant. Additionally, given that penetration loss is frequency-dependent, we summarize its variation within the FR3 band. Based on experimental results, comparisons with the standard model reveal that while the 3GPP TR 38.901 model remains a useful reference for penetration loss in wood and glass, it shows significant deviations for concrete and glass, indicating the need for further refinement. In summary, the findings of this survey provide both empirical data and theoretical support for the deployment of mid-band in future communication systems, as well as guidance for optimizing mid-band base station deployment in the outdoor environment. This survey offers the reference for improving standard models and advancing channel modeling.