Renmin University of China
Abstract:Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.
Abstract:Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.
Abstract:Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.
Abstract:Thermal infrared target tracking is crucial in applications such as surveillance, autonomous driving, and military operations. In this paper, we propose a novel tracker, SMTT, which effectively addresses common challenges in thermal infrared imagery, such as noise, occlusion, and rapid target motion, by leveraging multi-task learning, joint sparse representation, and adaptive graph regularization. By reformulating the tracking task as a multi-task learning problem, the SMTT tracker independently optimizes the representation of each particle while dynamically capturing spatial and feature-level similarities using a weighted mixed-norm regularization strategy. To ensure real-time performance, we incorporate the Accelerated Proximal Gradient method for efficient optimization. Extensive experiments on benchmark datasets - including VOT-TIR, PTB-TIR, and LSOTB-TIR - demonstrate that SMTT achieves superior accuracy, robustness, and computational efficiency. These results highlight SMTT as a reliable and high-performance solution for thermal infrared target tracking in complex environments.
Abstract:Thermal infrared (TIR) target tracking methods often adopt the correlation filter (CF) framework due to its computational efficiency. However, the low resolution of TIR images, along with tracking interference, significantly limits the perfor-mance of TIR trackers. To address these challenges, we introduce STARS, a novel sparse learning-based CF tracker that incorporates spatio-temporal regulari-zation and super-resolution reconstruction. First, we apply adaptive sparse filter-ing and temporal domain filtering to extract key features of the target while reduc-ing interference from background clutter and noise. Next, we introduce an edge-preserving sparse regularization method to stabilize target features and prevent excessive blurring. This regularization integrates multiple terms and employs the alternating direction method of multipliers to optimize the solution. Finally, we propose a gradient-enhanced super-resolution method to extract fine-grained TIR target features and improve the resolution of TIR images, addressing performance degradation in tracking caused by low-resolution sequences. To the best of our knowledge, STARS is the first to integrate super-resolution methods within a sparse learning-based CF framework. Extensive experiments on the LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that STARS outperforms state-of-the-art trackers in terms of robustness.
Abstract:Correlation filter (CF)-based trackers have gained significant attention for their computational efficiency in thermal infrared (TIR) target tracking. However, ex-isting methods struggle with challenges such as low-resolution imagery, occlu-sion, background clutter, and target deformation, which severely impact tracking performance. To overcome these limitations, we propose RAMCT, a region-adaptive sparse correlation filter tracker that integrates multi-channel feature opti-mization with an adaptive regularization strategy. Firstly, we refine the CF learn-ing process by introducing a spatially adaptive binary mask, which enforces spar-sity in the target region while dynamically suppressing background interference. Secondly, we introduce generalized singular value decomposition (GSVD) and propose a novel GSVD-based region-adaptive iterative Tikhonov regularization method. This enables flexible and robust optimization across multiple feature channels, improving resilience to occlusion and background variations. Thirdly, we propose an online optimization strategy with dynamic discrepancy-based pa-rameter adjustment. This mechanism facilitates real time adaptation to target and background variations, thereby improving tracking accuracy and robustness. Ex-tensive experiments on LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that RAMCT outperforms other state-of-the-art trackers in terms of accuracy and robustness.
Abstract:Thermal infrared (TIR) images typically lack detailed features and have low contrast, making it challenging for conventional feature extraction models to capture discriminative target characteristics. As a result, trackers are often affected by interference from visually similar objects and are susceptible to tracking drift. To address these challenges, we propose a novel saliency-guided Siamese network tracker based on key fine-grained feature infor-mation. First, we introduce a fine-grained feature parallel learning convolu-tional block with a dual-stream architecture and convolutional kernels of varying sizes. This design captures essential global features from shallow layers, enhances feature diversity, and minimizes the loss of fine-grained in-formation typically encountered in residual connections. In addition, we propose a multi-layer fine-grained feature fusion module that uses bilinear matrix multiplication to effectively integrate features across both deep and shallow layers. Next, we introduce a Siamese residual refinement block that corrects saliency map prediction errors using residual learning. Combined with deep supervision, this mechanism progressively refines predictions, ap-plying supervision at each recursive step to ensure consistent improvements in accuracy. Finally, we present a saliency loss function to constrain the sali-ency predictions, directing the network to focus on highly discriminative fi-ne-grained features. Extensive experiment results demonstrate that the pro-posed tracker achieves the highest precision and success rates on the PTB-TIR and LSOTB-TIR benchmarks. It also achieves a top accuracy of 0.78 on the VOT-TIR 2015 benchmark and 0.75 on the VOT-TIR 2017 benchmark.
Abstract:To address the challenge of capturing highly discriminative features in ther-mal infrared (TIR) tracking, we propose a novel Siamese tracker based on cross-channel fine-grained feature learning and progressive fusion. First, we introduce a cross-channel fine-grained feature learning network that employs masks and suppression coefficients to suppress dominant target features, en-abling the tracker to capture more detailed and subtle information. The net-work employs a channel rearrangement mechanism to enhance efficient in-formation flow, coupled with channel equalization to reduce parameter count. Additionally, we incorporate layer-by-layer combination units for ef-fective feature extraction and fusion, thereby minimizing parameter redun-dancy and computational complexity. The network further employs feature redirection and channel shuffling strategies to better integrate fine-grained details. Second, we propose a specialized cross-channel fine-grained loss function designed to guide feature groups toward distinct discriminative re-gions of the target, thus improving overall target representation. This loss function includes an inter-channel loss term that promotes orthogonality be-tween channels, maximizing feature diversity and facilitating finer detail capture. Extensive experiments demonstrate that our proposed tracker achieves the highest accuracy, scoring 0.81 on the VOT-TIR 2015 and 0.78 on the VOT-TIR 2017 benchmark, while also outperforming other methods across all evaluation metrics on the LSOTB-TIR and PTB-TIR benchmarks.
Abstract:As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.
Abstract:Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks. However, many studies rely on toy datasets or heavy feature reduction, raising concerns about their scalability. Progress is further hindered by hardware limitations and the significant costs of encoding dense vector representations on quantum devices. To address these challenges, we propose an efficient approach called Hamiltonian classifier that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings and computing predictions as their expectation values. In addition, we introduce two classifier variants with different scaling in terms of parameters and sample complexity. We evaluate our approach on text and image classification tasks, against well-established classical and quantum models. The Hamiltonian classifier delivers performance comparable to or better than these methods. Notably, our method achieves logarithmic complexity in both qubits and quantum gates, making it well-suited for large-scale, real-world applications. We make our implementation available on GitHub.