Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China




Abstract:Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3\% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2\%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.
Abstract:Multiple Instance Learning is the predominant method for Whole Slide Image classification in digital pathology, enabling the use of slide-level labels to supervise model training. Although MIL eliminates the tedious fine-grained annotation process for supervised learning, whether it can learn accurate bag- and instance-level classifiers remains a question. To address the issue, instance-level classifiers and instance masks were incorporated to ground the prediction on supporting patches. These methods, while practically improving the performance of MIL methods, may potentially introduce noisy labels. We propose to bridge the gap between commonly used MIL and fully supervised learning by augmenting both the bag- and instance-level learning processes with pseudo-label correction capabilities elicited from weak to strong generalization techniques. The proposed algorithm improves the performance of dual-level MIL algorithms on both bag- and instance-level predictions. Experiments on public pathology datasets showcase the advantage of the proposed methods.




Abstract:The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation. However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results. Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation. To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose interactive Segment-and-Manipulate 3D Gaussians (iSegMan), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view. To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (EIP), which innovatively exploits epipolar constraint for efficient and robust interaction matching. To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (VGV), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility. Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a Manipulation Toolbox to implement various functions on selected regions, enhancing the controllability, flexibility and practicality of scene manipulation. Extensive results on 3D scene manipulation and segmentation tasks fully demonstrate the significant advantages of iSegMan. Project page is available at https://zhao-yian.github.io/iSegMan.




Abstract:Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose PathoCLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both PathoCLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: https://github.com/Wenchuan-Zhang/Patho-R1.




Abstract:Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein LMs as controls in zero-shot prediction of RNA secondary structure and functional classification. Results shows that the models doing well on secondary structure prediction often perform worse in function classification or vice versa, suggesting that more balanced unsupervised training is needed.




Abstract:Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code will be released upon acceptance.




Abstract:Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.




Abstract:Image decomposition offers deep insights into the imaging factors of visual data and significantly enhances various advanced computer vision tasks. In this work, we introduce a novel approach to low-light image enhancement based on decomposed physics-informed priors. Existing methods that directly map low-light to normal-light images in the sRGB color space suffer from inconsistent color predictions and high sensitivity to spectral power distribution (SPD) variations, resulting in unstable performance under diverse lighting conditions. To address these challenges, we introduce a Physics-informed Color-aware Transform (PiCat), a learning-based framework that converts low-light images from the sRGB color space into deep illumination-invariant descriptors via our proposed Color-aware Transform (CAT). This transformation enables robust handling of complex lighting and SPD variations. Complementing this, we propose the Content-Noise Decomposition Network (CNDN), which refines the descriptor distributions to better align with well-lit conditions by mitigating noise and other distortions, thereby effectively restoring content representations to low-light images. The CAT and the CNDN collectively act as a physical prior, guiding the transformation process from low-light to normal-light domains. Our proposed PiCat framework demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.




Abstract:Integrated heterogeneous service provisioning (IHSP) is a promising paradigm that is designed to concurrently support a variety of heterogeneous services, extending beyond sensing and communication to meet the diverse needs of emerging applications. However, a primary challenge of IHSP is addressing the conflicts between multiple competing service demands under constrained resources. In this paper, we overcome this challenge by the joint use of two novel elastic design strategies: compromised service value assessment and flexible multi-dimensional resource multiplexing. Consequently, we propose a value-prioritized elastic multi-dimensional multiple access (MDMA) mechanism for IHSP systems. First, we modify the Value-of-Service (VoS) metric by incorporating elastic parameters to characterize user-specific tolerance and compromise in response to various performance degradations under constrained resources. This VoS metric serves as the foundation for prioritizing services and enabling effective fairness service scheduling among concurrent competing demands. Next, we adapt the MDMA to elastically multiplex services using appropriate multiple access schemes across different resource domains. This protocol leverages user-specific interference tolerances and cancellation capabilities across different domains to reduce resource-demanding conflicts and co-channel interference within the same domain. Then, we maximize the system's VoS by jointly optimizing MDMA design and power allocation. Since this problem is non-convex, we propose a monotonic optimization-assisted dynamic programming (MODP) algorithm to obtain its optimal solution. Additionally, we develop the VoS-prioritized successive convex approximation (SCA) algorithm to efficiently find its suboptimal solution. Finally, simulations are presented to validate the effectiveness of the proposed designs.
Abstract:Coordinated beamforming across distributed base stations (BSs) in cell-free architectures can efficiently support integrated sensing and communication (ISAC) users by improving resource sharing and reducing conflicts in the spatial domain. However, coordinating numerous BSs within the ISAC network poses risks of generating substantial interference for other networks sharing the spectrum, while also increasing operational costs from power consumption and signaling overhead. Therefore, in this paper, we propose an interference-suppressed and cost-optimized cell-free ISAC network by opportunistically and cooperatively orchestrating distributed radio resources to address competing sensing and communication (S\&C) demands. Specifically, we conceive a radiation footprint control mechanism that autonomously suppresses interference across the entire signal propagation space to safeguard other networks without exchanging signaling. Then, we propose joint BS activation and beamforming coordination to dynamically activate appropriate BSs and orchestrate their spatial beams for service provisioning. Building upon this framework, we formulate a cost-efficient utility maximization problem that considers individual S\&C demands and location-dependent radiation footprint constraints. Since this results in a non-convex optimization problem, we develop a monotonic optimization embedded branch-and-bound (MO-BRB) algorithm to find the optimal solution. Additionally, we apply a low-complexity iterative method to obtain near-optimal solutions. Finally, simulation results validate the effectiveness of the proposed algorithms.