Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Mapping

On the (Non) Injectivity of Piecewise Linear Janossy Pooling

May 26, 2025

Ilai Reshef, Nadav Dym

Abstract:Multiset functions, which are functions that map multisets to vectors, are a fundamental tool in the construction of neural networks for multisets and graphs. To guarantee that the vector representation of the multiset is faithful, it is often desirable to have multiset mappings that are both injective and bi-Lipschitz. Currently, there are several constructions of multiset functions achieving both these guarantees, leading to improved performance in some tasks but often also to higher compute time than standard constructions. Accordingly, it is natural to inquire whether simpler multiset functions achieving the same guarantees are available. In this paper, we make a large step towards giving a negative answer to this question. We consider the family of k-ary Janossy pooling, which includes many of the most popular multiset models, and prove that no piecewise linear Janossy pooling function can be injective. On the positive side, we show that when restricted to multisets without multiplicities, even simple deep-sets models suffice for injectivity and bi-Lipschitzness.

* 14 pages, 2 figures

Via

Access Paper or Ask Questions

OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender

May 26, 2025

Shintaro Ito, Natsuki Takama, Toshiki Watanabe, Koichi Ito, Hwann-Tzong Chen, Takafumi Aoki

Abstract:Recent advancements in radiance field rendering, exemplified by Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly progressed 3D modeling and reconstruction. The use of multiple 360-degree omnidirectional images for these tasks is increasingly favored due to advantages in data acquisition and comprehensive scene capture. However, the inherent geometric distortions in common omnidirectional representations, such as equirectangular projection (particularly severe in polar regions and varying with latitude), pose substantial challenges to achieving high-fidelity 3D reconstructions. Current datasets, while valuable, often lack the specific focus, scene composition, and ground truth granularity required to systematically benchmark and drive progress in overcoming these omnidirectional-specific challenges. To address this critical gap, we introduce Omnidirectional Blender 3D (OB3D), a new synthetic dataset curated for advancing 3D reconstruction from multiple omnidirectional images. OB3D features diverse and complex 3D scenes generated from Blender 3D projects, with a deliberate emphasis on challenging scenarios. The dataset provides comprehensive ground truth, including omnidirectional RGB images, precise omnidirectional camera parameters, and pixel-aligned equirectangular maps for depth and normals, alongside evaluation metrics. By offering a controlled yet challenging environment, OB3Daims to facilitate the rigorous evaluation of existing methods and prompt the development of new techniques to enhance the accuracy and reliability of 3D reconstruction from omnidirectional images.

Via

Access Paper or Ask Questions

Understanding Generalization in Diffusion Models via Probability Flow Distance

May 26, 2025

Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu

Abstract:Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ($\texttt{PFD}$), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, $\texttt{PFD}$ quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using $\texttt{PFD}$ under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.

* 41 pages, 14 figures

Via

Access Paper or Ask Questions

Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

May 26, 2025

Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich

Abstract:Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

* 9 pages main paper, 23 pages in total, code at https://github.com/mainlp/explaind

Via

Access Paper or Ask Questions

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

May 26, 2025

Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono

Abstract:The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

* Related dataset: https://doi.org/10.5281/zenodo.15411810

Via

Access Paper or Ask Questions

Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

May 26, 2025

Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu

Abstract:Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants'', target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student\_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100\%$ improvement in simulation accuracy.

Via

Access Paper or Ask Questions

ICDM: Interference Cancellation Diffusion Models for Wireless Semantic Communications

May 26, 2025

Tong Wu, Zhiyong Chen, Dazhi He, Feng Yang, Meixia Tao, Xiaodong Xu, Wenjun Zhang, Ping Zhang

Abstract:Diffusion models (DMs) have recently achieved significant success in wireless communications systems due to their denoising capabilities. The broadcast nature of wireless signals makes them susceptible not only to Gaussian noise, but also to unaware interference. This raises the question of whether DMs can effectively mitigate interference in wireless semantic communication systems. In this paper, we model the interference cancellation problem as a maximum a posteriori (MAP) problem over the joint posterior probability of the signal and interference, and theoretically prove that the solution provides excellent estimates for the signal and interference. To solve this problem, we develop an interference cancellation diffusion model (ICDM), which decomposes the joint posterior into independent prior probabilities of the signal and interference, along with the channel transition probablity. The log-gradients of these distributions at each time step are learned separately by DMs and accurately estimated through deriving. ICDM further integrates these gradients with advanced numerical iteration method, achieving accurate and rapid interference cancellation. Extensive experiments demonstrate that ICDM significantly reduces the mean square error (MSE) and enhances perceptual quality compared to schemes without ICDM. For example, on the CelebA dataset under the Rayleigh fading channel with a signal-to-noise ratio (SNR) of $20$ dB and signal to interference plus noise ratio (SINR) of 0 dB, ICDM reduces the MSE by 4.54 dB and improves the learned perceptual image patch similarity (LPIPS) by 2.47 dB.

* submitted to IEEE journal

Via

Access Paper or Ask Questions

Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

May 26, 2025

Kilian Sennrich, Sina Ahmadi

Abstract:Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

* Accepted to LDK 2025 - the 5th Conference on Language, Data and Knowledge. Naples, Italy, 9-11 September 2025

Via

Access Paper or Ask Questions

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

May 26, 2025

Rong-Cheng Tu, Wenhao Sun, Hanzhe You, Yingjie Wang, Jiaxing Huang, Li Shen, Dacheng Tao

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, <reference image, modification text, target image>, using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5\% over existing baselines; on CIRR, it boosts R@1 by 9.6\%; and on CIRCO, it increases mAP@5 by 9.5\%.

Via

Access Paper or Ask Questions

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

May 26, 2025

Shuang Ao, Flora D. Salim, Simon Khan

Abstract:Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

Via

Access Paper or Ask Questions

Topic:Mapping

Papers and Code