Henry
Abstract:We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Abstract:In this paper, we present a new wireless sensing system utilizing a movable antenna (MA) that continuously moves and receives sensing signals to enhance sensing performance over the conventional fixed-position antenna (FPA) sensing. We show that the angle estimation performance is fundamentally determined by the MA trajectory, and derive the Cramer-Rao bound (CRB) of the mean square error (MSE) for angle-of-arrival (AoA) estimation as a function of the trajectory for both one-dimensional (1D) and two-dimensional (2D) antenna movement. For the 1D case, a globally optimal trajectory that minimizes the CRB is derived in closed form. Notably, the resulting CRB decreases cubically with sensing time in the time-constrained regime, whereas it decreases linearly with sensing time and quadratically with the movement line segment's length in the space-constrained regime. For the 2D case, we aim to achieve the minimum of maximum (min-max) CRBs of estimation MSE for the two AoAs with respect to the horizontal and vertical axes. To this end, we design an efficient alternating optimization algorithm that iteratively updates the MA's horizontal or vertical coordinates with the other being fixed, yielding a locally optimal trajectory. Numerical results show that the proposed 1D/2D MA-based sensing schemes significantly reduce both the CRB and actual AoA estimation MSE compared to conventional FPA-based sensing with uniform linear/planar arrays (ULAs/UPAs) as well as various benchmark MA trajectories. Moreover, it is revealed that the steering vectors of our designed 1D/2D MA trajectories have low correlation in the angular domain, thereby effectively increasing the angular resolution for achieving higher AoA estimation accuracy.
Abstract:Glaucoma is one of the leading causes of irreversible blindness worldwide. Glaucoma prognosis is essential for identifying at-risk patients and enabling timely intervention to prevent blindness. Many existing approaches rely on historical sequential data but are constrained by fixed-length inputs, limiting their flexibility. Additionally, traditional glaucoma prognosis methods often employ end-to-end models, which struggle with the limited size of glaucoma datasets. To address these challenges, we propose a Two-Stage Decoupling Framework (TSDF) for variable-length glaucoma prognosis. In the first stage, we employ a feature representation module that leverages self-supervised learning to aggregate multiple glaucoma datasets for training, disregarding differences in their supervisory information. This approach enables datasets of varying sizes to learn better feature representations. In the second stage, we introduce a temporal aggregation module that incorporates an attention-based mechanism to process sequential inputs of varying lengths, ensuring flexible and efficient utilization of all available data. This design significantly enhances model performance while maintaining a compact parameter size. Extensive experiments on two benchmark glaucoma datasets:the Ocular Hypertension Treatment Study (OHTS) and the Glaucoma Real-world Appraisal Progression Ensemble (GRAPE),which differ significantly in scale and clinical settings,demonstrate the effectiveness and robustness of our approach.
Abstract:ML models are susceptible to risks to security, privacy, and fairness. Several defenses are designed to protect against their intended risks, but can inadvertently affect susceptibility to other unrelated risks, known as unintended interactions. Several jurisdictions are preparing ML regulatory frameworks that require ML practitioners to assess the susceptibility of ML models to different risks. A library for valuating unintended interactions that can be used by (a) practitioners to evaluate unintended interactions at scale prior to model deployment and (b) researchers to design defenses which do not suffer from an unintended increase in unrelated risks. Ideally, such a library should be i) comprehensive by including representative attacks, defenses and metrics for different risks, ii) extensible to new modules due to its modular design, iii) consistent with a user-friendly API template for inputs and outputs, iv) applicable to evaluate previously unexplored unintended interactions. We present AMULET, a Python library that covers risks to security, privacy, and fairness, which satisfies all these requirements. AMULET can be used to evaluate unexplored unintended interactions, compare effectiveness between defenses or attacks, and include new attacks and defenses.
Abstract:This paper proposes a new architecture for the low-earth orbit (LEO) satellite ground station aided by movable antenna (MA) array. Unlike conventional fixed-position antenna (FPA), the MA array can flexibly adjust antenna positions to reconfigure array geometry, for more effectively mitigating interference and improving communication performance in ultra-dense LEO satellite networks. To reduce movement overhead, we configure antenna positions at the antenna initialization stage, which remain unchanged during the whole communication period of the ground station. To this end, an optimization problem is formulated to maximize the average achievable rate of the ground station by jointly optimizing its antenna position vector (APV) and time-varying beamforming weights, i.e., antenna weight vectors (AWVs). To solve the resulting non-convex optimization problem, we adopt the Lagrangian dual transformation and quadratic transformation to reformulate the objective function into a more tractable form. Then, we develop an efficient block coordinate descent-based iterative algorithm that alternately optimizes the APV and AWVs until convergence is reached. Simulation results demonstrate that our proposed MA scheme significantly outperforms traditional FPA by increasing the achievable rate at ground stations under various system setups, thus providing an efficient solution for interference mitigation in future ultra-dense LEO satellite communication networks.
Abstract:This paper studies the intelligent reflecting surface (IRS) deployment optimization problem for IRS-enabled integrated sensing and communications (ISAC) systems, in which multiple IRSs are strategically deployed at candidate locations to assist a base station (BS) to enhance the coverage of both sensing and communications. We present an environment-aware IRS deployment design via exploiting the channel knowledge map (CKM), which provides the channel state information (CSI) between each candidate IRS location and BS or targeted sensing/communication points. Based on the obtained CSI from CKM, we optimize the deployment of IRSs, jointly with the BS's transmit beamforming and IRSs' reflective beamforming during operation, with the objective of minimizing the system cost, while guaranteeing the minimum illumination power requirements at sensing areas and the minimum signal-to-noise ratio (SNR) requirements at communication areas. In particular, we consider two cases when the IRSs' reflective beamforming optimization can be implemented dynamically in real time and quasi-stationarily over the whole operation period, respectively. For both cases, the joint IRS deployment and transmit/reflective beamforming designs are formulated as mixed-integer non-convex optimization problems, which are solved via the successive convex approximation (SCA)-based relax-and-bound method. Specifically, we first relax the binary IRS deployment indicators into continuous variables, then find converged solutions via SCA, and finally round relaxed indicators back to binary values. Numerical results demonstrate the effectiveness of our proposed algorithms in reducing the system cost while meeting the sensing and communication requirements.
Abstract:Global human motion reconstruction from in-the-wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates-a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human-motion-centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World-aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard-decoding approaches. Through experiments on in-the-wild benchmarks, WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera-human motion relationships and offers new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.
Abstract:Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
Abstract:Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.
Abstract:Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.