Abstract:Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
Abstract:Line differential microphone arrays have attracted attention for their ability to achieve frequency-invariant beampatterns and high directivity. Recently, the Jacobi-Anger expansion-based approach has enabled the design of fully steerable-invariant differential beamformers for line arrays combining omnidirectional and directional microphones. However, this approach relies on the analytical expression of the ideal beam pattern and the proper selection of truncation order, which is not always practical. This paper introduces a null-constraint-based method for designing frequency- and steerable-invariant differential beamformers using a line array of omnidirectional and directional microphones. The approach employs a multi-constraint optimisation framework, where the reference filter and ideal beam pattern are first determined based on specified nulls and desired direction. Subsequently, the white noise gain constraint is derived from the reference filter, and the beampattern constraint is from the ideal beam pattern. The optimal filter is then obtained by considering constraints related to the beampattern, nulls, and white noise gain. This method achieves a balance between white noise gain and mean square error, allowing robust, frequency- and steerableinvariant differential beamforming performance. It addresses limitations in beam pattern flexibility and truncation errors, offering greater design freedom and improved practical applicability. Simulations and experiments demonstrate that this method outperforms the Jacobi-Anger expansion-based approach in three key aspects: an extended effective range, improved main lobe and null alignment, and greater flexibility in microphone array configuration and beam pattern design, requiring only steering direction and nulls instead of an analytic beam pattern expression.




Abstract:We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.