Abstract:Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.
Abstract:Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.
Abstract:Low-altitude wireless networks (LAWNs) are expected to play a central role in future 6G infrastructures, yet uplink transmissions of uncrewed aerial vehicles (UAVs) remain vulnerable to eavesdropping due to their limited transmit power, constrained antenna resources, and highly exposed air-ground propagation conditions. To address this fundamental bottleneck, we propose a flexible-duplex cell-free (CF) architecture in which each distributed access point (AP) can dynamically operate either as a receive AP for UAV uplink collection or as a transmit AP that generates cooperative artificial noise (AN) for secrecy enhancement. Such AP-level duplex flexibility introduces an additional spatial degree of freedom that enables distributed and adaptive protection against wiretapping in LAWNs. Building upon this architecture, we formulate a max-min secrecy-rate problem that jointly optimizes AP mode selection, receive combining, and AN covariance design. This tightly coupled and nonconvex optimization is tackled by first deriving the optimal receive combiners in closed form, followed by developing a penalty dual decomposition (PDD) algorithm with guaranteed convergence to a stationary solution. To further reduce computational burden, we propose a low-complexity sequential scheme that determines AP modes via a heuristic metric and then updates the AN covariance matrices through closed-form iterations embedded in the PDD framework. Simulation results show that the proposed flexible-duplex architecture yields substantial secrecy-rate gains over CF systems with fixed AP roles. The joint optimization method attains the highest secrecy performance, while the low-complexity approach achieves over 90% of the optimal performance with an order-of-magnitude lower computational complexity, offering a practical solution for secure uplink communications in LAWNs.
Abstract:To overcome inherent limitations of analog signals in over-the-air computation (AirComp), this letter proposes a two's complement-based coding scheme for the AirComp implementation with compatible digital modulations. Specifically, quantized discrete values are encoded into binary sequences using the two's complement and transmitted over multiple subcarriers. At the receiver, we design a decoder that constructs a functional mapping between the superimposed digital modulation signals and the target of computational results, theoretically ensuring asymptotic error free computation with the minimal codeword length. To further mitigate the adverse effects of channel fading, we adopt a truncated inversion strategy for pre-processing. Benefiting from the unified symbol distribution after the proposed encoding, we derive the optimal linear minimum mean squared error (LMMSE) detector in closed form and propose a low complexity algorithm seeking for the optimal truncation selection. Furthermore, the inherent importance differences among the coded outputs motivate an uneven power allocation strategy across subcarriers to improve computational accuracy. Numerical results validate the superiority of the proposed scheme over existing digital AirComp approaches, especially at low signal to-noise ratio (SNR) regimes.
Abstract:Vision-language-action (VLA) models have enabled language-conditioned, long-horizon robot manipulation, but most existing systems are limited to grippers. Scaling VLA policies to bimanual robots with high degree-of-freedom (DoF) dexterous hands remains challenging due to the expanded action space, frequent hand-object occlusions, and the cost of collecting real-robot data. We present GR-Dexter, a holistic hardware-model-data framework for VLA-based generalist manipulation on a bimanual dexterous-hand robot. Our approach combines the design of a compact 21-DoF robotic hand, an intuitive bimanual teleoperation system for real-robot data collection, and a training recipe that leverages teleoperated robot trajectories together with large-scale vision-language and carefully curated cross-embodiment datasets. Across real-world evaluations spanning long-horizon everyday manipulation and generalizable pick-and-place, GR-Dexter achieves strong in-domain performance and improved robustness to unseen objects and unseen instructions. We hope GR-Dexter serves as a practical step toward generalist dexterous-hand robotic manipulation.
Abstract:Operating power amplifiers (PAs) at lower input back-off (IBO) levels is an effective way to improve PA efficiency, but often introduces severe nonlinear distortion that degrades transmission performance. Amplitude-phase-time block modulation (APTBM) has recently emerged as an effective solution to this problem. By leveraging the intrinsic amplitude and phase constraints of each APTBM block, PA-induced nonlinear distortion can be mitigated through constraint-guided signal reconstruction. However, existing reconstruction methods apply these constraints only heuristically and statistically, limiting the achievable IBO reduction and PA efficiency improvement. This paper addresses this limitation by decomposing the nonlinear distortion into dominant and residual components, and accordingly develops a novel two-stage signal reconstruction algorithm consisting of coarse and fine reconstruction stages. The coarse reconstruction stage eliminates the dominant distortion by jointly exploiting the APTBM block structure and PA nonlinear characteristics. The fine reconstruction stage minimizes the residual distortion by formulating a nonconvex optimization problem that explicitly enforces the APTBM constraints. To handle this problem efficiently, a low-complexity iterative variable substitution method is introduced, which relaxes the problem into a sequence of trust-region subproblems, each solvable in closed form. The proposed algorithm is validated through comprehensive numerical simulations and testbed experiments. Results show that it achieves up to 4 dB IBO reduction in simulations and up to 2 dB IBO reduction in experiments while maintaining transmission performance, corresponding to PA efficiency improvements of 59.1\% and 33.9\%, respectively, over existing methods.




Abstract:The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.




Abstract:This paper investigates a project with stochastic activity durations and cash flows under discrete scenarios, where activities must satisfy precedence constraints generating cash inflows and outflows. The objective is to maximize expected net present value (NPV) by accelerating inflows and deferring outflows. We formulate the problem as a discrete-time Markov Decision Process (MDP) and propose a Double Deep Q-Network (DDQN) approach. Comparative experiments demonstrate that DDQN outperforms traditional rigid and dynamic strategies, particularly in large-scale or highly uncertain environments, exhibiting superior computational capability, policy reliability, and adaptability. Ablation studies further reveal that the dual-network architecture mitigates overestimation of action values, while the target network substantially improves training convergence and robustness. These results indicate that DDQN not only achieves higher expected NPV in complex project optimization but also provides a reliable framework for stable and effective policy implementation.
Abstract:Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.
Abstract:Networked integrated sensing and communication (ISAC) has gained significant attention as a promising technology for enabling next-generation wireless systems. To further enhance networked ISAC, delegating the reception of sensing signals to dedicated target monitoring terminals (TMTs) instead of base stations (BSs) offers significant advantages in terms of sensing capability and deployment flexibility. Despite its potential, the coordinated beamforming design for networked integrated communication and time-of-arrival (ToA)-based multi-TMT localization remains largely unexplored. In this paper, we present a comprehensive study to fill this gap. Specifically, we first establish signal models for both communication and localization, and, for the first time, derive a closed-form Cram\'er-Rao lower bound (CRLB) to characterize the localization performance. Subsequently, we exploit this CRLB to formulate two optimization problems, focusing on sensing-centric and communication-centric criteria, respectively. For the sensing-centric problem, we develop a globally optimal algorithm based on semidefinite relaxation (SDR) when each BS is equipped with more antennas than the total number of communication users. While for the communication-centric problem, we design a globally optimal algorithm for the single-BS case using bisection search. For the general case of both problems, we propose a unified successive convex approximation (SCA)-based algorithm, which is suboptimal yet efficient, and further extend it from single-target scenarios to more practical multi-target scenarios. Finally, simulation results demonstrate the effectiveness of our proposed algorithms, reveal the intrinsic performance trade-offs between communication and localization, and further show that deploying more TMTs is always preferable to deploying more BSs in networked ISAC systems.