Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China, School of Computing, University of Portsmouth, Portsmouth, United Kingdom
Abstract:Robust GNSS positioning in urban environments is still plagued by multipath effects, particularly due to the complex signal propagation induced by ubiquitous surfaces with varied radio frequency reflectivities. Current 3D Mapping Aided (3DMA) GNSS techniques show great potentials in mitigating multipath but face a critical trade-off between computational efficiency and modeling accuracy. Most approaches often rely on offline outdated or oversimplified 3D maps, while real-time LiDAR-based reconstruction boasts high accuracy, it is problematic in low laser reflectivity conditions; camera 3DMA is a good candidate to balance accuracy and efficiency but current methods suffer from extremely low reconstruction speed, a far cry from real-time multipath-mitigated navigation. This paper proposes an accelerated framework incorporating camera multi-view stereo (MVS) reconstruction and ray tracing. By hypothesizing on surface textures, an orthogonal visual feature fusion framework is proposed, which robustly addresses both texture-rich and texture-poor surfaces, lifting off the reflectivity challenges in visual reconstruction. A polygonal surface modeling scheme is further integrated to accurately delineate complex building boundaries, enhancing the reconstruction granularity. To avoid excessively accurate reconstruction, reprojected point cloud multi-plane fitting and two complexity control strategies are proposed, thus improving upon multipath estimation speed. Experiments were conducted in Lujiazui, Shanghai, a typical multipath-prone district. The results show that the method achieves an average reconstruction accuracy of 2.4 meters in dense urban environments featuring glass curtain wall structures, a traditionally tough case for reconstruction, and achieves a ray-tracing-based multipath correction rate of 30 image frames per second, 10 times faster than the contemporary benchmarks.
Abstract:Process-based models (PBMs) and deep learning (DL) are two key approaches in agricultural modelling, each offering distinct advantages and limitations. PBMs provide mechanistic insights based on physical and biological principles, ensuring interpretability and scientific rigour. However, they often struggle with scalability, parameterisation, and adaptation to heterogeneous environments. In contrast, DL models excel at capturing complex, nonlinear patterns from large datasets but may suffer from limited interpretability, high computational demands, and overfitting in data-scarce scenarios. This study presents a systematic review of PBMs, DL models, and hybrid PBM-DL frameworks, highlighting their applications in agricultural and environmental modelling. We classify hybrid PBM-DL approaches into DL-informed PBMs, where neural networks refine process-based models, and PBM-informed DL, where physical constraints guide deep learning predictions. Additionally, we conduct a case study on crop dry biomass prediction, comparing hybrid models against standalone PBMs and DL models under varying data quality, sample sizes, and spatial conditions. The results demonstrate that hybrid models consistently outperform traditional PBMs and DL models, offering greater robustness to noisy data and improved generalisation across unseen locations. Finally, we discuss key challenges, including model interpretability, scalability, and data requirements, alongside actionable recommendations for advancing hybrid modelling in agriculture. By integrating domain knowledge with AI-driven approaches, this study contributes to the development of scalable, interpretable, and reproducible agricultural models that support data-driven decision-making for sustainable agriculture.
Abstract:Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.
Abstract:While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to execute reasoning tasks in pseudo-programs with greater logical rigor. We conduct a comprehensive evaluation on nine public datasets, covering three reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%. Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable accuracy and efficient execution, highlighting its strong practical applicability and scalability.
Abstract:Instruction tuning has enabled large language models (LLMs) to achieve remarkable performance, but its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that high-quality instruction-response pairs are not defined by the individual quality of each component, but by the extent of their alignment with each other. To address this, we propose a Mutual Alignment Framework (MAIN) that ensures coherence between the instruction and response through mutual constraints. Experiments demonstrate that models such as LLaMA and Mistral, fine-tuned within this framework, outperform traditional methods across multiple benchmarks. This approach underscores the critical role of instruction-response alignment in enabling scalable and high-quality instruction tuning for LLMs.
Abstract:Achieving reliable and safe autonomous driving in off-road environments requires accurate and efficient terrain traversability analysis. However, this task faces several challenges, including the scarcity of large-scale datasets tailored for off-road scenarios, the high cost and potential errors of manual annotation, the stringent real-time requirements of motion planning, and the limited computational power of onboard units. To address these challenges, this paper proposes a novel traversability learning method that leverages self-supervised learning, eliminating the need for manual annotation. For the first time, a Birds-Eye View (BEV) representation is used as input, reducing computational burden and improving adaptability to downstream motion planning. During vehicle operation, the proposed method conducts online analysis of traversed regions and dynamically updates prototypes to adaptively assess the traversability of the current environment, effectively handling dynamic scene changes. We evaluate our approach against state-of-the-art benchmarks on both public datasets and our own dataset, covering diverse seasons and geographical locations. Experimental results demonstrate that our method significantly outperforms recent approaches. Additionally, real-world vehicle experiments show that our method operates at 10 Hz, meeting real-time requirements, while a 5.5 km autonomous driving experiment further validates the generated traversability cost maps compatibility with downstream motion planning.
Abstract:Multi-objective preference alignment in language models often encounters a challenging trade-off: optimizing for one human preference (e.g., helpfulness) frequently compromises others (e.g., harmlessness) due to the inherent conflicts between competing objectives. While prior work mainly focuses on algorithmic solutions, we explore a novel data-driven approach to uncover the types of data that can effectively mitigate these conflicts. Specifically, we propose the concept of Reward Consistency (RC), which identifies samples that align with multiple preference objectives, thereby reducing conflicts during training. Through gradient-based analysis, we demonstrate that RC-compliant samples inherently constrain performance degradation during multi-objective optimization. Building on these insights, we further develop Reward Consistency Sampling, a framework that automatically constructs preference datasets that effectively mitigate conflicts during multi-objective alignment. Our generated data achieves an average improvement of 13.37% in both the harmless rate and helpfulness win rate when optimizing harmlessness and helpfulness, and can consistently resolve conflicts in varying multi-objective scenarios.
Abstract:This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
Abstract:Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
Abstract:Orthogonal time frequency space (OTFS) modulation is widely acknowledged as a prospective waveform for future wireless communication networks.To provide insights for the practical system design, this paper analyzes the outage probability of OTFS modulation with finite blocklength.To begin with, we present the system model and formulate the analysis of outage probability for OTFS with finite blocklength as an equivalent problem of calculating the outage probability with finite blocklength over parallel additive white Gaussian noise (AWGN) channels.Subsequently, we apply the equivalent noise approach to derive a lower bound on the outage probability of OTFS with finite blocklength under both average power allocation and water-filling power allocation strategies, respectively.Finally, the lower bounds of the outage probability are determined using the Monte-Carlo method for the two power allocation strategies.The impact of the number of resolvable paths and coding rates on the outage probability is analyzed, and the simulation results are compared with the theoretical lower bounds.