Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiqin Wang

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

May 23, 2026

Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu, Haoji Zhang, Yansong Tang

Abstract:As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ($C_{\text{score}}$) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 $C_{\text{score}}$, with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.

Via

Access Paper or Ask Questions

Robust Spatiotemporal Motion Planning for Multi-Agent Autonomous Racing via Topological Gap Identification and Accelerated MPC

Mar 10, 2026

Mingyi Zhang, Cheng Hu, Yiqin Wang, Haotong Qin, Hongye Su, Lei Xie

Abstract:High-speed multi-agent autonomous racing demands robust spatiotemporal planning and precise control under strict computational limits. Current methods often oversimplify interactions or abandon strict kinematic constraints. We resolve this by proposing a Topological Gap Identification and Accelerated MPC framework. By predicting opponent behaviors via SGPs, our method constructs dynamic occupancy corridors to robustly select optimal overtaking gaps. We ensure strict kinematic feasibility using a Linear Time-Varying MPC powered by a customized Pseudo-Transient Continuation (PTC) solver for high-frequency execution. Experimental results on the F1TENTH platform show that our method significantly outperforms state-of-the-art baselines: it reduces total maneuver time by 51.6% in sequential scenarios, consistently maintains an overtaking success rate exceeding 81% in dense bottlenecks, and lowers average computational latency by 20.3%, pushing the boundaries of safe and high-speed autonomous racing.

Via

Access Paper or Ask Questions

Vision-Augmented On-Track System Identification for Autonomous Racing via Attention-Based Priors and Iterative Neural Correction

Mar 10, 2026

Zhiping Wu, Cheng Hu, Yiqin Wang, Lei Xie, Hongye Su

Abstract:Operating autonomous vehicles at the absolute limits of handling requires precise, real-time identification of highly non-linear tire dynamics. However, traditional online optimization methods suffer from "cold-start" initialization failures and struggle to model high-frequency transient dynamics. To address these bottlenecks, this paper proposes a novel vision-augmented, iterative system identification framework. First, a lightweight CNN (MobileNetV3) translates visual road textures into a continuous heuristic friction prior, providing a robust "warm-start" for parameter optimization. Next, a S4 model captures complex temporal dynamic residuals, circumventing the memory and latency limitations of traditional MLPs and RNNs. Finally, a derivative-free Nelder-Mead algorithm iteratively extracts physically interpretable Pacejka tire parameters via a hybrid virtual simulation. Co-simulation in CarSim demonstrates that the lightweight vision backbone reduces friction estimation error by 76.1 using 85 fewer FLOPs, accelerating cold-start convergence by 71.4. Furthermore, the S4-augmented framework improves parameter extraction accuracy and decreases lateral force RMSE by over 60 by effectively capturing complex vehicle dynamics, demonstrating superior performance compared to conventional neural architectures.

Via

Access Paper or Ask Questions

BrainRVQ: A High-Fidelity EEG Foundation Model via Dual-Domain Residual Quantization and Hierarchical Autoregression

Feb 18, 2026

Mingzhe Cui, Tao Chen, Yang Jiao, Yiqin Wang, Lei Xie, Yi Pan, Luca Mainardi

Abstract:Developing foundation models for electroencephalography (EEG) remains challenging due to the signal's low signal-to-noise ratio and complex spectro-temporal non-stationarity. Existing approaches often overlook the hierarchical latent structure inherent in neural dynamics, leading to suboptimal reconstruction of fine-grained information. In this work, we propose BrainRVQ, a general-purpose EEG foundation model pre-trained on a large-scale corpus of clinical EEG data. Unlike standard masked modeling, BrainRVQ features a Dual-Domain Residual Vector Quantization (DD-RVQ) tokenizer that disentangles temporal waveforms and spectral patterns into hierarchical discrete codes. We further introduce a hierarchical autoregressive pre-training objective that learns to reconstruct these codes in a coarse-to-fine manner, utilizing an importance-guided curriculum masking strategy to prioritize information-rich neural events over background noise. Extensive experiments across 8 diverse downstream datasets demonstrate that BrainRVQ consistently outperforms state-of-the-art baselines, validating its effectiveness in learning robust and generalizable neural representations. Our code and model weights are available:https://github.com/keqicmz/BrainRVQ

Via

Access Paper or Ask Questions

Kimi K2.5: Visual Agentic Intelligence

Feb 02, 2026

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen(+315 more)

Abstract:We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

* Kimi K2.5 tech report

Via

Access Paper or Ask Questions

Kimi K2: Open Agentic Intelligence

Jul 28, 2025

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen(+159 more)

Figure 1 for Kimi K2: Open Agentic Intelligence

Figure 2 for Kimi K2: Open Agentic Intelligence

Figure 3 for Kimi K2: Open Agentic Intelligence

Figure 4 for Kimi K2: Open Agentic Intelligence

Abstract:We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

* tech report of Kimi K2

Via

Access Paper or Ask Questions

A Geography-Inspired and Self-Adaptive Clustering Algorithm: A Study in Channel Measurement

Apr 29, 2025

Yiqin Wang, Chong Han

Figure 1 for A Geography-Inspired and Self-Adaptive Clustering Algorithm: A Study in Channel Measurement

Figure 2 for A Geography-Inspired and Self-Adaptive Clustering Algorithm: A Study in Channel Measurement

Figure 3 for A Geography-Inspired and Self-Adaptive Clustering Algorithm: A Study in Channel Measurement

Figure 4 for A Geography-Inspired and Self-Adaptive Clustering Algorithm: A Study in Channel Measurement

Abstract:The phenomenon that multi-path components (MPCs) arrive in clusters has been verified by channel measurements, and is widely adopted by cluster-based channel models. As a crucial intermediate processing step, MPC clustering bridges raw data in channel measurement and cluster characteristics for channel modeling. In this paper, a physical-interpretable and self-adaptive MPC clustering algorithm is proposed, which can locate both single-point and wide-spread scatterers without prior knowledge. Inspired by the concept in geography, a novel metaphor that interprets features of MPC attributes in the power-delay-angle profile (PDAP) as topographic concepts is developed. In light of the interpretation, the proposed algorithm disassembles the PDAP by constructing contour lines and identifying characteristic points that indicate the skeleton of MPC clusters, which are fitted by analytical models that associate MPCs with physical scatterer locations. Besides, a new clustering performance index, the power gradient consistency index, is proposed. Calculated as the weighted Spearman correlation coefficient between the power and the distance to the center, the index captures the intrinsic property of MPC clusters that the dominant high-power path is surrounded by lower-power paths. The performance of the proposed algorithm is analyzed and compared with the counterparts of conventional clustering algorithms based on the channel measurement conducted in an outdoor scenario. The proposed algorithm performs better in average Silhouette index and weighted Spearman correlation coefficient, and the average root mean square error (RMSE) of the estimated scatterer location is 0.1 m.

Via

Access Paper or Ask Questions

Kimi-VL Technical Report

Apr 10, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei(+82 more)

Abstract:We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Via

Access Paper or Ask Questions

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

Dec 02, 2024

Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang

Abstract:Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/

Via

Access Paper or Ask Questions

Movable Antennas: Channel Measurement, Modeling, and Performance Evaluation

Sep 05, 2024

Yiqin Wang, Heyin Shen, Chong Han, Meixia Tao

Abstract:Since decades ago, multi-antenna has become a key enabling technology in the evolution of wireless communication systems. In contrast to conventional multi-antenna systems that contain antennas at fixed positions, position-flexible antenna systems have been proposed to fully utilize the spatial variation of wireless channels. In this paper, movable antenna (MA) systems are analyzed from channel measurement, modeling, position optimization to performance evaluation. First, a broadband channel measurement system with physical MAs is developed, for which the extremely high movable resolution reaches 0.02 mm. A practical two-ray model is constructed based on the channel measurement for a two-dimensional movable antenna system across 32$\times$32 planar port positions at 300 GHz. In light of the measurement results, spatial-correlated channel models for the two-dimensional MA system are proposed, which are statistically parameterized by the covariance matrix of measured channels. Finally, the signal-to-interference-and-noise ratio (SINR)-maximized position selection algorithm is proposed, which achieves 99% of the optimal performance. The performance of different MA systems in terms of spectral efficiency are evaluated and compared for both planar and linear MA systems. Extensive results demonstrate the advantage of MAs over fixed-position antennas in coping with the multi-path fading and improving the spectral efficiency by 10% in a 300 GHz measured channel.

* 12 pages, 31 figures

Via

Access Paper or Ask Questions