Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingxuan Li

Causal Flow Q-Learning for Robust Offline Reinforcement Learning

Feb 02, 2026

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Abstract:Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator's and the learner's sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies' worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120\% that of confounding-unaware, state-of-the-art offline RL methods.

Via

Access Paper or Ask Questions

IFG: Internet-Scale Guidance for Functional Grasping Generation

Nov 12, 2025

Ray Muxin Liu, Mingxuan Li, Kenneth Shaw, Deepak Pathak

Figure 1 for IFG: Internet-Scale Guidance for Functional Grasping Generation

Figure 2 for IFG: Internet-Scale Guidance for Functional Grasping Generation

Figure 3 for IFG: Internet-Scale Guidance for Functional Grasping Generation

Figure 4 for IFG: Internet-Scale Guidance for Functional Grasping Generation

Abstract:Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/

* Website at https://ifgrasping.github.io/

Via

Access Paper or Ask Questions

Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Nov 12, 2025

Jonathan Broadbent, Michael Bailey, Mingxuan Li, Abhishek Paul, Louis De Lescure, Paul Chauvin, Lorenzo Kogler-Anele, Yasser Jangjou, Sven Jager

Figure 1 for Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Figure 2 for Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Figure 3 for Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Figure 4 for Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Abstract:Accurate prediction of small molecule solubility using material-sparing approaches is critical for accelerating synthesis and process optimization, yet experimental measurement is costly and many learning approaches either depend on quantumderived descriptors or offer limited interpretability. We introduce Solvaformer, a geometry-aware graph transformer that models solutions as multiple molecules with independent SE(3) symmetries. The architecture combines intramolecular SE(3)-equivariant attention with intermolecular scalar attention, enabling cross-molecular communication without imposing spurious relative geometry. We train Solvaformer in a multi-task setting to predict both solubility (log S) and solvation free energy, using an alternating-batch regimen that trains on quantum-mechanical data (CombiSolv-QM) and on experimental measurements (BigSolDB 2.0). Solvaformer attains the strongest overall performance among the learned models and approaches a DFT-assisted gradient-boosting baseline, while outperforming an EquiformerV2 ablation and sequence-based alternatives. In addition, token-level attention produces chemically coherent attributions: case studies recover known intra- vs. inter-molecular hydrogen-bonding patterns that govern solubility differences in positional isomers. Taken together, Solvaformer provides an accurate, scalable, and interpretable approach to solution-phase property prediction by uniting geometric inductive bias with a mixed dataset training strategy on complementary computational and experimental data.

Via

Access Paper or Ask Questions

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Oct 16, 2025

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu

Abstract:The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

* 21 pages, 7 figures

Via

Access Paper or Ask Questions

Automatic Reward Shaping from Confounded Offline Data

May 16, 2025

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Abstract:A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.

* ICML 2025

Via

Access Paper or Ask Questions

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Apr 09, 2025

Mingxuan Li, Hanchen Li, Chenhao Tan

Abstract:Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

* 22 pages, 3 figures, code link: https://github.com/ChicagoHAI/HypoEval-Gen

Via

Access Paper or Ask Questions

Causally Aligned Curriculum Learning

Mar 21, 2025

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Abstract:A pervasive challenge in Reinforcement Learning (RL) is the "curse of dimensionality" which is the exponential growth in the state-action space when optimizing a high-dimensional target task. The framework of curriculum learning trains the agent in a curriculum composed of a sequence of related and more manageable source tasks. The expectation is that when some optimal decision rules are shared across source tasks and the target task, the agent could more quickly pick up the necessary skills to behave optimally in the environment, thus accelerating the learning process. However, this critical assumption of invariant optimal decision rules does not necessarily hold in many practical applications, specifically when the underlying environment contains unobserved confounders. This paper studies the problem of curriculum RL through causal lenses. We derive a sufficient graphical condition characterizing causally aligned source tasks, i.e., the invariance of optimal decision rules holds. We further develop an efficient algorithm to generate a causally aligned curriculum, provided with qualitative causal knowledge of the target task. Finally, we validate our proposed methodology through experiments in discrete and continuous confounded tasks with pixel observations.

* Accepted as Posters in ICLR 2024

Via

Access Paper or Ask Questions

Modeling, Simulation, and Application of Spatio-Temporal Characteristics Detection in Incipient Slip

Feb 24, 2025

Mingxuan Li, Lunwei Zhang, Qiyin Huang, Tiemin Li, Yao Jiang

Abstract:Incipient slip detection provides critical feedback for robotic grasping and manipulation tasks. However, maintaining its adaptability under diverse object properties and complex working conditions remains challenging. This article highlights the importance of completely representing spatio-temporal features of slip, and proposes a novel approach for incipient slip modeling and detection. Based on the analysis of localized displacement phenomenon, we establish the relationship between the characteristic strain rate extreme events and the local slip state. This approach enables the detection of both the spatial distribution and temporal dynamics of stick-slip regions. Also, the proposed method can be applied to strain distribution sensing devices, such as vision-based tactile sensors. Simulations and prototype experiments validated the effectiveness of this approach under varying contact conditions, including different contact geometries, friction coefficients, and combined loads. Experiments demonstrated that this method not only accurately and reliably delineates incipient slip, but also facilitates friction parameter estimation and adaptive grasping control.

* 21 pages, 19 figures

Via

Access Paper or Ask Questions

Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Oct 22, 2024

Haokun Liu, Yangqiaoyu Zhou, Mingxuan Li, Chenfei Yuan, Chenhao Tan

Figure 1 for Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Figure 2 for Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Figure 3 for Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Figure 4 for Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Abstract:AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.

* 30 pages, 7 figures, code link: https://github.com/ChicagoHAI/hypothesis-generation

Via

Access Paper or Ask Questions

The Comparison of Individual Cat Recognition Using Neural Networks

Oct 03, 2024

Mingxuan Li, Kai Zhou

Abstract:Facial recognition using deep learning has been widely used in social life for applications such as authentication, smart door locks, and photo grouping, etc. More and more networks have been developed to facilitate computer vision tasks, such as ResNet, DenseNet, EfficientNet, ConvNeXt, and Siamese networks. However, few studies have systematically compared the advantages and disadvantages of such neural networks in identifying individuals from images, especially for pet animals like cats. In the present study, by systematically comparing the efficacy of different neural networks in cat recognition, we found traditional CNNs trained with transfer learning have better performance than models trained with the fine-tuning method or Siamese networks in individual cat recognition. In addition, ConvNeXt and DenseNet yield significant results which could be further optimized for individual cat recognition in pet stores and in the wild. These results provide a method to improve cat management in pet stores and monitoring of cats in the wild.

* 13 pages,7 figures

Via

Access Paper or Ask Questions