Abstract:In this paper, we address the channel estimation (CE) problem in SIM-based multi-user (MU) millimeter-wave (mmWave) near-field communication systems. To address the severe path loss and blockage in mmWave communication systems, many meta-atoms are typically integrated into each layer of the SIM. Then, the number of radio frequency (RF) chains at the base station (BS) is fewer than that of meta-atoms per layer, resulting in an underdetermined problem. Additionally, the increase in the number of meta-atoms in each layer expands the SIM's near-field region, leading to the user equipment (UEs) being mostly situated in this region, necessitating precise modeling of the channel under the spherical wavefront assumption. To address these issues, we introduce a compressed sensing (CS)-based CE protocol to tackle the underdetermined problem. In contrast to the traditional CS-based estimation framework, we investigate a polar-domain channel representation to tackle the severe energy spread effect of the classical angular-domain channel representation in near-field communication systems. Specifically, we design a novel polar-domain transform matrix for uniform planar arrays (UPAs), thereby transforming the CE problem into a sparse recovery task of the paths' support set and complex gains. To overcome the limitations of the sparse Bayesian learning (SBL) framework in tackling high-dimensional dictionaries, we propose a low-complexity polar-domain SBL (LCPD-SBL) algorithm, which significantly reduces computational complexity without compromising estimation accuracy.
Abstract:Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.
Abstract:Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance (see our demo on the website: https://westworldrobot.github.io/). The code will be available upon publication.
Abstract:Stacked intelligent metasurfaces (SIMs) have recently emerged as a key enabler for realizing electromagnetic wave-domain signal processing in next-generation wireless networks. However, practical SIM implementations often suffer from noticeable mismatches between theoretical models and measured responses due to fabrication and assembly imperfections. This article systematically investigates the problem of interlayer error calibration in SIMs. We first classify representative modeling and hardware-induced imperfections. Then, we outline the major challenges in SIM calibration and further develop a general framework that integrates a calibration protocol with the relevant solution strategies. Moreover, we investigate the effectiveness of the multi-stage calibration approach in mitigating geometric deviations and improving the alignment between the calibrated and practical propagation coefficients. Finally, we elaborate on key research opportunities and practical challenges toward realizing physically consistent and hardware-compliant SIM implementations for future research.
Abstract:Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method's superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY-Zong/Physics-guided-Causal-Model.
Abstract:Two-channel modulo analog-to-digital converters (ADCs) enable high-dynamic-range signal sensing at the Nyquist rate per channel, but existing designs quantise both channel outputs independently, incurring redundant bitrate costs. This paper proposes a bit-efficient quantisation scheme that exploits the integer-valued structure of inter-channel differences, transmitting one quantised channel output together with a compact difference index. We prove that this approach requires only 1-2 bits per signal sample overhead relative to conventional ADCs, despite operating with a much smaller per-channel dynamic range. Simulations confirm the theoretical error bounds and bitrate analysis, while hardware experiments demonstrate substantial bitrate savings compared with existing modulo sampling schemes, while maintaining comparable reconstruction accuracy. These results highlight a practical path towards high-resolution, bandwidth-efficient modulo ADCs for bitrate-constrained systems.
Abstract:Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.




Abstract:Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.




Abstract:Conventional analog-to-digital converters (ADCs) clip when signals exceed their input range. Modulo (unlimited) sampling overcomes this limitation by folding the signal before digitization, but existing recovery methods are either computationally intensive or constrained by loose oversampling bounds that demand high sampling rates. In addition, none account for sampling jitter, which is unavoidable in practice. This paper revisits difference-based recovery and establishes new theoretical and practical guarantees. In the noiseless setting, we prove that arbitrarily high difference order reduces the sufficient oversampling factor from $2\pi e$ to $\pi$, substantially tightening classical bounds. For fixed order $N$, we derive a noise-aware sampling condition that guarantees stable recovery. For second-order difference-based recovery ($N=2$), we further extend the analysis to non-uniform sampling, proving robustness under bounded jitter. An FPGA-based hardware prototype demonstrates reliable reconstruction with amplitude expansion up to $\rho = 108$, confirming the feasibility of high-performance unlimited sensing with a simple and robust recovery pipeline.
Abstract:Semantic communication (SemCom) powered by generative artificial intelligence enables highly efficient and reliable information transmission. However, it still necessitates the transmission of substantial amounts of data when dealing with complex scene information. In contrast, the stacked intelligent metasurface (SIM), leveraging wave-domain computing, provides a cost-effective solution for directly imaging complex scenes. Building on this concept, we propose an innovative SIM-aided multi-modal SemCom system. Specifically, an SIM is positioned in front of the transmit antenna for transmitting visual semantic information of complex scenes via imaging on the uniform planar array at the receiver. Furthermore, the simple scene description that contains textual semantic information is transmitted via amplitude-phase modulation over electromagnetic waves. To simultaneously transmit multi-modal information, we optimize the amplitude and phase of meta-atoms in the SIM using a customized gradient descent algorithm. The optimization aims to gradually minimize the mean squared error between the normalized energy distribution on the receiver array and the desired pattern corresponding to the visual semantic information. By combining the textual and visual semantic information, a conditional generative adversarial network is used to recover the complex scene accurately. Extensive numerical results verify the effectiveness of the proposed multi-modal SemCom system in reducing bandwidth overhead as well as the capability of the SIM for imaging the complex scene.