Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Abstract:Recently, under the presumption of a noise-free input, the augmented complex least lncosh (ACLlncosh) method was introduced for a power system frequency estimate and showed robust performance when impulsive noise polluted the output signal. However, in practical terms, noise often contaminates input signals, which drastically reduces the efficiency of the ACLlncosh method. To enhance robustness against noisy input-output while maintaining resilience to impulsive noise in the output signal, this paper proposed an online censoring-based widely linear total least lncosh (OC-WL-TLlnC) method. This method improves performance under both balanced and unbalanced settings by filtering out less valuable data via online censoring, hence reducing the computing burden. Furthermore, a variable parameter approach is incorporated to accelerate convergence and improve steady-state accuracy, thereby ensuring adaptability to dynamic power system conditions. The proposed methods significantly enhance frequency estimate performance by addressing the constraints of current techniques and offering a computationally efficient, noise-resilient solution for real-time power system monitoring.
Abstract:Conventional Kalman filtering (KF) approaches exhibit significant limitations in addressing nonlinear state estimation problems contaminated by non-Gaussian noise disturbances. To overcome these challenges, this work proposes a robust iterative square root unscented Kalman Filter based on the generalized correntropy induced (SR-GCI-IUKF). While sharing the maximum correntropy criterion's (MCC) ability to characterize higher-order noise statistics, the proposed GCI framework exhibits intrinsic kernel bandwidth insensitivit a critical advantage enabling robust adaptation to diverse complex noise environments through its generalized kernel structure. For nonlinear state estimation challenges, the algorithm constructs a nonlinear error generalization model that dynamically corrects measurement-induced errors during the state update phase, thereby significantly enhancing estimation accuracy in strongly nonlinear regimes. Furthermore, the square-root decomposition implementation ensures numerical robustness by preserving covariance matrix positive definiteness throughout recursive operations. Theoretical stability guarantees are established through rigorous error dynamics analysis, demonstrating bounded estimation variance under non-Gaussian disturbances. Finally, experiments are carried out in nonlinear systems, land vehicle navigation systems as well as power system FASE to compare other robust algorithms, and it is determined that the proposed algorithm has stronger robustness.
Abstract:A Conventional centralized state estimators exhibit limited robustness in large-scale grids and face practical deployment hurdles. To overcome these challenges, this paper proposes a decentralized maximum generalized Student's t-kernel correntropy Variational Bayesian unscented Kalman filter (D-MGST-VBUKF). The algorithm optimizes the estimation performance at three levels for the regionalized state estimation needs: first, to address non-Gaussian measurement noise in practical systems, we propose the cost function using MGST, retaining Student's t robustness while improving adaptability to complex noise by expanding the degree-of-freedom parameter; secondly, the VB inference framework is constructed to model the unknown noise distribution online, and the joint optimization of the noise statistical characteristics and state estimation is realized by constructing the conjugate prior distribution; finally, the regional state fusion mechanism is established based on the topological correlation characteristics of the power grid, and the global consistency correction of the local estimation results is realized by constructing the state coordination equation of the boundary nodes. Simulation experiments in IEEE 14-bus and IEEE 39-bus system show that the method has stronger robustness compared with the traditional algorithm under non-Gaussian noise environment and unknown noise environment.
Abstract:Adaptive filter in complex scenarios demands algorithms that integrate fast convergence, low complexity, and robust performance under diverse noise conditions. To address this challenge, we propose a online censoring robust total generalized adaptive filter using improved data-reused method (RTGA-IDROC) algorithm. The proposed RTGA variant possesses the advantages of both the total least squares (TLS) strategy and the robust generalized adaptive (RGA) function. This algorithm not only effectively handles input noise under the errors-in-variables (EIV) model but also achieves excellent performance across diverse noise environments. Furthermore, to meet the high demand for convergence speed in practical applications, an improved data reuse (IDR) method is introduced, enabling faster convergence in the early stages of iteration without compromising steady-state performance. The increased computational complexity brought by the IDR method is mitigated using the online censoring (OC) strategy. We also modify the OC threshold for real-valued algorithms, as the original threshold was defined for the complex domain. Beyond these algorithmic enhancements, a local stability analysis for the proposed algorithm is provided, and the theoretical steady-state mean-square deviation (MSD) is derived. Finally, simulation experiments in system identification and acoustic echo cancellation (AEC) scenarios validate the superior performance of the proposed algorithm.
Abstract:The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.
Abstract:Building upon the mean p-power error (MPE) criterion, the normalized subband p-norm (NSPN) algorithm demonstrates superior robustness in $α$-stable noise environments ($1 < α\leq 2$) through effective utilization of low-order moment hidden in robust loss functions. Nevertheless, its performance degrades significantly when processing noise input or additive noise characterized by $α$-stable processes ($0 < α\leq 1$). To overcome these limitations, we propose a novel fractional-order NSPN (FoNSPN) algorithm that incorporates the fractional-order stochastic gradient descent (FoSGD) method into the MPE framework. Additionally, this paper also analyzes the convergence range of its step-size, the theoretical domain of values for the fractional-order $β$, and establishes the theoretical steady-state mean square deviation (MSD) model. Simulations conducted in diverse impulsive noise environments confirm the superiority of the proposed FoNSPN algorithm against existing state-of-the-art algorithms.




Abstract:Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.




Abstract:We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.




Abstract:Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.
Abstract:Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.