Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Xu

Linda

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Oct 14, 2025

Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan

Abstract:Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

Via

Access Paper or Ask Questions

Memory-Efficient 2D/3D Shape Assembly of Robot Swarms

Sep 30, 2025

Shuoyu Yue, Pengpeng Li, Yang Xu, Kunrui Ze, Xingjian Long, Huazi Cao, Guibin Sun

Figure 1 for Memory-Efficient 2D/3D Shape Assembly of Robot Swarms

Figure 2 for Memory-Efficient 2D/3D Shape Assembly of Robot Swarms

Figure 3 for Memory-Efficient 2D/3D Shape Assembly of Robot Swarms

Figure 4 for Memory-Efficient 2D/3D Shape Assembly of Robot Swarms

Abstract:Mean-shift-based approaches have recently emerged as the most effective methods for robot swarm shape assembly tasks. These methods rely on image-based representations of target shapes to compute local density gradients and perform mean-shift exploration, which constitute their core mechanism. However, such image representations incur substantial memory overhead, which can become prohibitive for high-resolution or 3D shapes. To overcome this limitation, we propose a memory-efficient tree map representation that hierarchically encodes user-specified shapes and is applicable to both 2D and 3D scenarios. Building on this representation, we design a behavior-based distributed controller that enables assignment-free shape assembly. Comparative 2D and 3D simulations against a state-of-the-art mean-shift algorithm demonstrate one to two orders of magnitude lower memory usage and two to three times faster shape entry while maintaining comparable uniformity. Finally, we validate the framework through physical experiments with 6 to 7 UAVs, confirming its real-world practicality.

Via

Access Paper or Ask Questions

Structure-Preserving Margin Distribution Learning for High-Order Tensor Data with Low-Rank Decomposition

Sep 18, 2025

Yang Xu, Junpeng Li, Changchun Hua, Yana Yang

Abstract:The Large Margin Distribution Machine (LMDM) is a recent advancement in classifier design that optimizes not just the minimum margin (as in SVM) but the entire margin distribution, thereby improving generalization. However, existing LMDM formulations are limited to vectorized inputs and struggle with high-dimensional tensor data due to the need for flattening, which destroys the data's inherent multi-mode structure and increases computational burden. In this paper, we propose a Structure-Preserving Margin Distribution Learning for High-Order Tensor Data with Low-Rank Decomposition (SPMD-LRT) that operates directly on tensor representations without vectorization. The SPMD-LRT preserves multi-dimensional spatial structure by incorporating first-order and second-order tensor statistics (margin mean and variance) into the objective, and it leverages low-rank tensor decomposition techniques including rank-1(CP), higher-rank CP, and Tucker decomposition to parameterize the weight tensor. An alternating optimization (double-gradient descent) algorithm is developed to efficiently solve the SPMD-LRT, iteratively updating factor matrices and core tensor. This approach enables SPMD-LRT to maintain the structural information of high-order data while optimizing margin distribution for improved classification. Extensive experiments on diverse datasets (including MNIST, images and fMRI neuroimaging) demonstrate that SPMD-LRT achieves superior classification accuracy compared to conventional SVM, vector-based LMDM, and prior tensor-based SVM extensions (Support Tensor Machines and Support Tucker Machines). Notably, SPMD-LRT with Tucker decomposition attains the highest accuracy, highlighting the benefit of structure preservation. These results confirm the effectiveness and robustness of SPMD-LRT in handling high-dimensional tensor data for classification.

Via

Access Paper or Ask Questions

Contrastive Multi-View Graph Hashing

Aug 17, 2025

Yang Xu, Zuliang Yang, Kai Ming Ting

Abstract:Multi-view graph data, which both captures node attributes and rich relational information from diverse sources, is becoming increasingly prevalent in various domains. The effective and efficient retrieval of such data is an important task. Although multi-view hashing techniques have offered a paradigm for fusing diverse information into compact binary codes, they typically assume attributes-based inputs per view. This makes them unsuitable for multi-view graph data, where effectively encoding and fusing complex topological information from multiple heterogeneous graph views to generate unified binary embeddings remains a significant challenge. In this work, we propose Contrastive Multi-view Graph Hashing (CMGHash), a novel end-to-end framework designed to learn unified and discriminative binary embeddings from multi-view graph data. CMGHash learns a consensus node representation space using a contrastive multi-view graph loss, which aims to pull $k$-nearest neighbors from all graphs closer while pushing away negative pairs, i.e., non-neighbor nodes. Moreover, we impose binarization constraints on this consensus space, enabling its conversion to a corresponding binary embedding space at minimal cost. Extensive experiments on several benchmark datasets demonstrate that CMGHash significantly outperforms existing approaches in terms of retrieval accuracy.

Via

Access Paper or Ask Questions

FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer

Aug 07, 2025

Jian Zhu, Shanyuan Liu, Liuzhuozheng Li, Yue Gong, He Wang, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng(+2 more)

Abstract:Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.

Via

Access Paper or Ask Questions

Distributed Oscillatory Guidance for Formation Flight of Fixed-Wing Drones

Jul 22, 2025

Yang Xu, Jesús Bautista, José Hinojosa, Héctor García de Marina

Abstract:The autonomous formation flight of fixed-wing drones is hard when the coordination requires the actuation over their speeds since they are critically bounded and aircraft are mostly designed to fly at a nominal airspeed. This paper proposes an algorithm to achieve formation flights of fixed-wing drones without requiring any actuation over their speed. In particular, we guide all the drones to travel over specific paths, e.g., parallel straight lines, and we superpose an oscillatory behavior onto the guiding vector field that drives the drones to the paths. This oscillation enables control over the average velocity along the path, thereby facilitating inter-drone coordination. Each drone adjusts its oscillation amplitude distributively in a closed-loop manner by communicating with neighboring agents in an undirected and connected graph. A novel consensus algorithm is introduced, leveraging a non-negative, asymmetric saturation function. This unconventional saturation is justified since negative amplitudes do not make drones travel backward or have a negative velocity along the path. Rigorous theoretical analysis of the algorithm is complemented by validation through numerical simulations and a real-world formation flight.

* Yang Xu and Jes\'us Bautista contributed equally to this work. In the proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

Via

Access Paper or Ask Questions

Off-Policy Evaluation Under Nonignorable Missing Data

Jul 09, 2025

Han Wang, Yang Xu, Wenbin Lu, Rui Song

Figure 1 for Off-Policy Evaluation Under Nonignorable Missing Data

Figure 2 for Off-Policy Evaluation Under Nonignorable Missing Data

Figure 3 for Off-Policy Evaluation Under Nonignorable Missing Data

Figure 4 for Off-Policy Evaluation Under Nonignorable Missing Data

Abstract:Off-Policy Evaluation (OPE) aims to estimate the value of a target policy using offline data collected from potentially different policies. In real-world applications, however, logged data often suffers from missingness. While OPE has been extensively studied in the literature, a theoretical understanding of how missing data affects OPE results remains unclear. In this paper, we investigate OPE in the presence of monotone missingness and theoretically demonstrate that the value estimates remain unbiased under ignorable missingness but can be biased under nonignorable (informative) missingness. To retain the consistency of value estimation, we propose an inverse probability weighted value estimator and conduct statistical inference to quantify the uncertainty of the estimates. Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data.

Via

Access Paper or Ask Questions

Deep regularization networks for inverse problems with noisy operators

Jun 08, 2025

Fatemeh Pourahmadian, Yang Xu

Abstract:A supervised learning approach is proposed for regularization of large inverse problems where the main operator is built from noisy data. This is germane to superresolution imaging via the sampling indicators of the inverse scattering theory. We aim to accelerate the spatiotemporal regularization process for this class of inverse problems to enable real-time imaging. In this approach, a neural operator maps each pattern on the right-hand side of the scattering equation to its affiliated regularization parameter. The network is trained in two steps which entails: (1) training on low-resolution regularization maps furnished by the Morozov discrepancy principle with nonoptimal thresholds, and (2) optimizing network predictions through minimization of the Tikhonov loss function regulated by the validation loss. Step 2 allows for tailoring of the approximate maps of Step 1 toward construction of higher quality images. This approach enables direct learning from test data and dispenses with the need for a-priori knowledge of the optimal regularization maps. The network, trained on low-resolution data, quickly generates dense regularization maps for high-resolution imaging. We highlight the importance of the training loss function on the network's generalizability. In particular, we demonstrate that networks informed by the logic of discrepancy principle lead to images of higher contrast. In this case, the training process involves many-objective optimization. We propose a new method to adaptively select the appropriate loss weights during training without requiring an additional optimization process. The proposed approach is synthetically examined for imaging damage evolution in an elastic plate. The results indicate that the discrepancy-informed regularization networks not only accelerate the imaging process, but also remarkably enhance the image quality in complex environments.

Via

Access Paper or Ask Questions

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Jun 08, 2025

Yang Xu, Swetha Ganesh, Vaneet Aggarwal

Abstract:We present the first $Q$-learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust $Q$ Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust $Q$ function in $\tilde{\cO}(\epsilon^{-2})$ samples. We also show that the same idea can be used for robust $Q$ function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an $\epsilon$-optimal robust policy in $\tilde{\cO}(\epsilon^{-3})$ samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.

* arXiv admin note: text overlap with arXiv:2502.16816

Via

Access Paper or Ask Questions

Autonomous Flights inside Narrow Tunnels

May 26, 2025

Luqi Wang, Yan Ning, Hongming Chen, Peize Liu, Yang Xu, Hao Xu, Ximin Lyu, Shaojie Shen

Figure 1 for Autonomous Flights inside Narrow Tunnels

Figure 2 for Autonomous Flights inside Narrow Tunnels

Figure 3 for Autonomous Flights inside Narrow Tunnels

Figure 4 for Autonomous Flights inside Narrow Tunnels

Abstract:Multirotors are usually desired to enter confined narrow tunnels that are barely accessible to humans in various applications including inspection, search and rescue, and so on. This task is extremely challenging since the lack of geometric features and illuminations, together with the limited field of view, cause problems in perception; the restricted space and significant ego airflow disturbances induce control issues. This paper introduces an autonomous aerial system designed for navigation through tunnels as narrow as 0.5 m in diameter. The real-time and online system includes a virtual omni-directional perception module tailored for the mission and a novel motion planner that incorporates perception and ego airflow disturbance factors modeled using camera projections and computational fluid dynamics analyses, respectively. Extensive flight experiments on a custom-designed quadrotor are conducted in multiple realistic narrow tunnels to validate the superior performance of the system, even over human pilots, proving its potential for real applications. Additionally, a deployment pipeline on other multirotor platforms is outlined and open-source packages are provided for future developments.

* 21 pages, 27 figures, accepted by IEEE Transactions on Robotics (T-RO)

Via

Access Paper or Ask Questions