Abstract:All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.
Abstract:Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.
Abstract:Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.
Abstract:Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.
Abstract:3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.
Abstract:While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
Abstract:Non-contrast computed tomography calcium scoring (CTCS) is a cost-effective imaging modality widely used to detect coronary artery calcifications. This study aimed to develop an advanced machine learning framework that utilizes quantitative analyses of coronary calcium and epicardial fat from CTCS images to predict obstructive coronary artery disease (CAD). The study population consisted of 1,324 patients from the SCOT-HEART clinical trial who underwent both CTCS and coronary CT angiography. We extracted and analyzed a broad range of features, including 24 clinical variables, 189 calcium-omics, and 211 epicardial fat-omics features from the CTCS images. Feature selection was conducted using the CatBoost algorithm combined with SHapley Additive exPlanation (SHAP) values. Predictive modeling utilized the CatBoost gradient boosting method, focusing on the most informative features. From an initial set of 424 candidate features, 14 were identified as most predictive through the CatBoost-SHAP method. The top two predictive features originated from fat-omics, with the remaining 12 features derived from calcium-omics. The optimized model achieved robust predictive capabilities, demonstrating a sensitivity of 83.1+/-4.6%, specificity of 93.8+/-1.7%, accuracy of 85.3+/-2.0%, and an F1 score of 73.9+/-3.3%. Inclusion of calcium-omics and fat-omics data significantly improved predictive performance. Notably, the model also showed reliable predictive accuracy in patients with diverse coronary calcium scores, including cases with obstructive CAD despite a zero-calcium score. This innovative approach holds promise for improving clinical decision-making and potentially reducing dependence on contrast-enhanced or invasive diagnostic procedures, particularly within low-to intermediate-risk patient groups.
Abstract:Non-contrast computed tomography calcium scoring (CTCS) is widely recognized as an effective tool for cardiovascular risk stratification. This study aimed to develop a novel machine learning framework for predicting myocardial ischemia from routine non-contrast CTCS scans using quantitative coronary calcium assessment. This study analyzed 1,375 patients who underwent both non-contrast CTCS and regadenoson stress cardiac positron emission tomography myocardial perfusion imaging within one year at University Hospitals Cleveland Medical Center. A total of 74 variables, including clinical variables, Agatston score, and calcium-omics features, were evaluated. Relevant features were identified using XGBoost with Shapley Additive exPlanations (SHAP). Predictive models were trained and evaluated using 5-fold cross-validation. Among 987 patients, 89 (9%) were positive for myocardial ischemia. The final model incorporated the Agatston score, eight calcium-omics features, and age. The proposed model achieved a precision of 98.9+/-3.0%, sensitivity of 79.2+/-8.4, and F1 score of 87.7+/-5.3%. The addition of calcium-omics features significantly improved predictive performance compared with models using clinical variables alone or clinical variables with the Agatston score (p<0.05). Interestingly, the number of calcified arteries, despite being the lowest-ranked feature based on SHAP analysis, showed the strongest association with myocardial ischemia in logistic regression analysis (odds ratio: 3.63, 95% confidence interval: 2.80-4.77, p<0.00001). We developed a machine learning approach for predicting myocardial ischemia using routinely acquired non-contrast CTCS scans. Calcium-omics features provided incremental predictive value beyond conventional risk factors and Agatston scoring and may support more accessible cardiovascular risk stratification.
Abstract:Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.
Abstract:Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.