Abstract:Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment - commonly believed to improve visual reasoning performance - often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.
Abstract:Reconfigurable intelligent surfaces enhance wireless systems by reshaping propagation environments. However, dynamic metasurfaces (MSs) with numerous phase-shift elements incur undesired control and hardware costs. In contrast, static MSs (SMSs), configured with static phase shifts pre-designed for specific communication demands, offer a cost-effective alternative by eliminating element-wise tuning. Nevertheless, SMSs typically support a single beam pattern with limited flexibility. In this paper, we propose a novel Movable Intelligent Surface (MIS) technology that enables dynamic beamforming while maintaining static phase shifts. Specifically, we design a MIS architecture comprising two closely stacked transmissive MSs: a larger fixed-position MS 1 and a smaller movable MS 2. By differentially shifting MS 2's position relative to MS 1, the MIS synthesizes distinct beam patterns. Then, we model the interaction between MS 2 and MS 1 using binary selection matrices and padding vectors and formulate a new optimization problem that jointly designs the MIS phase shifts and selects shifting positions for worst-case signal-to-noise ratio maximization. This position selection, equal to beam pattern scheduling, offers a new degree of freedom for RIS-aided systems. To solve the intractable problem, we develop an efficient algorithm that handles unit-modulus and binary constraints and employs manifold optimization methods. Finally, extensive validation results are provided. We implement a MIS prototype and perform proof-of-concept experiments, demonstrating the MIS's ability to synthesize desired beam patterns that achieve one-dimensional beam steering. Numerical results show that by introducing MS 2 with a few elements, MIS effectively offers beamforming flexibility for significantly improved performance. We also draw insights into the optimal MIS configuration and element allocation strategy.
Abstract:Movable antennas (MAs), which can be swiftly repositioned within a defined region, offer a promising solution to the limitations of fixed-position antennas (FPAs) in adapting to spatial variations in wireless channels, thereby improving channel conditions and communication between transceivers. However, frequent MA position adjustments based on instantaneous channel state information (CSI) incur high operational complexity, making real-time CSI acquisition impractical, especially in fast-fading channels. To address these challenges, we propose a two-timescale transmission framework for MA-enabled multiuser multiple-input-multiple-output (MU-MIMO) systems. In the large timescale, statistical CSI is exploited to optimize MA positions for long-term ergodic performance, whereas, in the small timescale, beamforming vectors are designed using instantaneous CSI to handle short-term channel fluctuations. Within this new framework, we analyze the ergodic sum rate and develop efficient MA position optimization algorithms for both maximum-ratio-transmission (MRT) and zero-forcing (ZF) beamforming schemes. These algorithms employ alternating optimization (AO), successive convex approximation (SCA), and majorization-minimization (MM) techniques, iteratively optimizing antenna positions and refining surrogate functions that approximate the ergodic sum rate. Numerical results show significant ergodic sum rate gains with the proposed two-timescale MA design over conventional FPA systems, particularly under moderate to strong line-of-sight (LoS) conditions. Notably, MA with ZF beamforming consistently outperforms MA with MRT, highlighting the synergy between beamforming and MAs for superior interference management in environments with moderate Rician factors and high user density, while MA with MRT can offer a simplified alternative to complex beamforming designs in strong LoS conditions.