Abstract:Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
Abstract:As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.




Abstract:In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
Abstract:The ability to automatically encircle boundaries with mobile robots is crucial for tasks such as border tracking and object enclosing. Previous research has primarily focused on regular boundaries, often assuming that their geometric equations are known in advance, which is not often the case in practice. In this paper, we investigate a more general case and propose an algorithm that addresses geometric irregularities of boundaries without requiring prior knowledge of their analytical expressions. To achieve this, we develop a Fourier-based curve fitting method for boundary approximation using sampled points, enabling parametric characterization of general 2-D boundaries. This approach allows star-shaped boundaries to be fitted into polar-angle-based parametric curves, while boundaries of other shapes are handled through decomposition. Then, we design a vector field (VF) to achieve the encirclement of the parameterized boundary, wherein a polar radius error is introduced to measure the robot's ``distance'' to the boundary. The controller is finally synthesized using a control barrier function and quadratic programming to mediate some potentially conflicting specifications: boundary encirclement, obstacle avoidance, and limited actuation. In this manner, the VF-guided reference control not only guides the boundary encircling action, but can also be minimally modified to satisfy obstacle avoidance and input saturation constraints. Simulations and experiments are presented to verify the performance of our new method, which can be applied to mobile robots to perform practical tasks such as cleaning chemical spills and environment monitoring.




Abstract:Ocean forecasting is crucial for both scientific research and societal benefits. Currently, the most accurate forecasting systems are global ocean forecasting systems (GOFSs), which represent the ocean state variables (OSVs) as discrete grids and solve partial differential equations (PDEs) governing the transitions of oceanic state variables using numerical methods. However, GOFSs processes are computationally expensive and prone to cumulative errors. Recently, large artificial intelligence (AI)-based models significantly boosted forecasting speed and accuracy. Unfortunately, building a large AI ocean forecasting system that can be considered cross-spatiotemporal and air-sea coupled forecasts remains a significant challenge. Here, we introduce LangYa, a cross-spatiotemporal and air-sea coupled ocean forecasting system. Results demonstrate that the time embedding module in LangYa enables a single model to make forecasts with lead times ranging from 1 to 7 days. The air-sea coupled module effectively simulates air-sea interactions. The ocean self-attention module improves network stability and accelerates convergence during training, and the adaptive thermocline loss function improves the accuracy of thermocline forecasting. Compared to existing numerical and AI-based ocean forecasting systems, LangYa uses 27 years of global ocean data from the Global Ocean Reanalysis and Simulation version 12 (GLORYS12) for training and achieves more reliable deterministic forecasting results for OSVs. LangYa forecasting system provides global ocean researchers with access to a powerful software tool for accurate ocean forecasting and opens a new paradigm for ocean science.




Abstract:In recent years, Artificial Intelligence Weather Prediction (AIWP) models have achieved performance comparable to, or even surpassing, traditional Numerical Weather Prediction (NWP) models by leveraging reanalysis data. However, a less-explored approach involves training AIWP models directly on observational data, enhancing computational efficiency and improving forecast accuracy by reducing the uncertainties introduced through data assimilation processes. In this study, we propose OMG-HD, a novel AI-based regional high-resolution weather forecasting model designed to make predictions directly from observational data sources, including surface stations, radar, and satellite, thereby removing the need for operational data assimilation. Our evaluation shows that OMG-HD outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF)'s high-resolution operational forecasting system, IFS-HRES, and the High-Resolution Rapid Refresh (HRRR) model at lead times of up to 12 hours across the contiguous United States (CONUS) region. We achieve up to a 13% improvement on RMSE for 2-meter temperature, 17% on 10-meter wind speed, 48% on 2-meter specific humidity, and 32% on surface pressure compared to HRRR. Our method shows that it is possible to use AI-driven approaches for rapid weather predictions without relying on NWP-derived weather fields as model input. This is a promising step towards using observational data directly to make operational forecasts with AIWP models.




Abstract:With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.




Abstract:ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. Considering ICD coding as a multi-label text classification task, researchers have developed sophisticated methods. Despite progress, these models often suffer from label imbalance and may develop spurious correlations with demographic factors. Additionally, while human coders assign ICD codes, the inclusion of irrelevant information from unrelated experts introduces biases. To combat these issues, we propose a novel method to mitigate Demographic and Expert biases in ICD coding through Causal Inference (DECI). We provide a novel causality-based interpretation in ICD Coding that models make predictions by three distinct pathways. And based counterfactual reasoning, DECI mitigate demographic and expert biases. Experimental results show that DECI outperforms state-of-the-art models, offering a significant advancement in accurate and unbiased ICD coding.




Abstract:Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.




Abstract:Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named \method-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/.