Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew Trager

UCLA-CS

Maximally-Informative Retrieval for State Space Model Generation

Jun 13, 2025

Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto

Abstract:Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.

Via

Access Paper or Ask Questions

PICASO: Permutation-Invariant Context Composition with State Space Models

Feb 24, 2025

Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto

Figure 1 for PICASO: Permutation-Invariant Context Composition with State Space Models

Figure 2 for PICASO: Permutation-Invariant Context Composition with State Space Models

Figure 3 for PICASO: Permutation-Invariant Context Composition with State Space Models

Figure 4 for PICASO: Permutation-Invariant Context Composition with State Space Models

Abstract:Providing Large Language Models with relevant contextual knowledge at inference time has been shown to greatly improve the quality of their generations. This is often achieved by prepending informative passages of text, or 'contexts', retrieved from external knowledge bases to their input. However, processing additional contexts online incurs significant computation costs that scale with their length. State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states from which to start the generation. A key challenge arises when attempting to leverage information present across multiple contexts, since there is no straightforward way to condition generation on multiple independent states in existing SSMs. To address this, we leverage a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating textual contexts. Since the temporal ordering of contexts can often be uninformative, we enforce permutation-invariance by efficiently averaging states obtained via our composition algorithm across all possible context orderings. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.

* Published in The Thirteenth International Conference on Learning Representations, ICLR 2025

Via

Access Paper or Ask Questions

Descriminative-Generative Custom Tokens for Vision-Language Models

Feb 17, 2025

Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille, Stefano Soatto

Figure 1 for Descriminative-Generative Custom Tokens for Vision-Language Models

Figure 2 for Descriminative-Generative Custom Tokens for Vision-Language Models

Figure 3 for Descriminative-Generative Custom Tokens for Vision-Language Models

Figure 4 for Descriminative-Generative Custom Tokens for Vision-Language Models

Abstract:This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.

Via

Access Paper or Ask Questions

An Invitation to Neuroalgebraic Geometry

Jan 31, 2025

Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathlén Kohn

Figure 1 for An Invitation to Neuroalgebraic Geometry

Figure 2 for An Invitation to Neuroalgebraic Geometry

Figure 3 for An Invitation to Neuroalgebraic Geometry

Figure 4 for An Invitation to Neuroalgebraic Geometry

Abstract:In this expository work, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.

Via

Access Paper or Ask Questions

Geometry and Optimization of Shallow Polynomial Networks

Jan 10, 2025

Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, Matthew Trager

Abstract:We study shallow neural networks with polynomial activations. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, that can be viewed as a problem of low-rank tensor approximation with respect to a non-standard inner product that is induced by the data distribution. In this setting, we introduce a teacher-metric discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

* 36 pages, 2 figures

Via

Access Paper or Ask Questions

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Nov 06, 2024

Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto

Figure 1 for The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Figure 2 for The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Figure 3 for The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Figure 4 for The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Abstract:Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

* ENLSP-IV 2024 - 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, Dec 2024, Vancouver, Canada

Via

Access Paper or Ask Questions

Compositional Structures in Neural Embedding and Interaction Decompositions

Jul 12, 2024

Matthew Trager, Alessandro Achille, Pramuditha Perera, Luca Zancato, Stefano Soatto

Figure 1 for Compositional Structures in Neural Embedding and Interaction Decompositions

Figure 2 for Compositional Structures in Neural Embedding and Interaction Decompositions

Figure 3 for Compositional Structures in Neural Embedding and Interaction Decompositions

Abstract:We describe a basic correspondence between linear algebraic structures within vector embeddings in artificial neural networks and conditional independence constraints on the probability distributions modeled by these networks. Our framework aims to shed light on the emergence of structural patterns in data representations, a phenomenon widely acknowledged but arguably still lacking a solid formal grounding. Specifically, we introduce a characterization of compositional structures in terms of "interaction decompositions," and we establish necessary and sufficient conditions for the presence of such structures within the representations of a model.

* 15 pages, 3 figures

Via

Access Paper or Ask Questions

B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory

Jul 08, 2024

Luca Zancato, Arjun Seshadri, Yonatan Dukler, Aditya Golatkar, Yantao Shen, Benjamin Bowman, Matthew Trager, Alessandro Achille, Stefano Soatto

Abstract:We describe a family of architectures to support transductive inference by allowing memory to grow to a finite but a-priori unknown bound while making efficient use of finite resources for inference. Current architectures use such resources to represent data either eidetically over a finite span ("context" in Transformers), or fading over an infinite span (in State Space Models, or SSMs). Recent hybrid architectures have combined eidetic and fading memory, but with limitations that do not allow the designer or the learning process to seamlessly modulate the two, nor to extend the eidetic memory span. We leverage ideas from Stochastic Realization Theory to develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an elementary composable module. The overall architecture can be used to implement models that can access short-term eidetic memory "in-context," permanent structural memory "in-weights," fading memory "in-state," and long-term eidetic memory "in-storage" by natively incorporating retrieval from an asynchronously updated memory. We show that Transformers, existing SSMs such as Mamba, and hybrid architectures such as Jamba are special cases of B'MOJO and describe a basic implementation, to be open sourced, that can be stacked and scaled efficiently in hardware. We test B'MOJO on transductive inference tasks, such as associative recall, where it outperforms existing SSMs and Hybrid models; as a baseline, we test ordinary language modeling where B'MOJO achieves perplexity comparable to similarly-sized Transformers and SSMs up to 1.4B parameters, while being up to 10% faster to train. Finally, we show that B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens, four-fold the length of the longest sequences seen during training.

Via

Access Paper or Ask Questions

NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Apr 30, 2024

Benet Oriol Sabat, Alessandro Achille, Matthew Trager, Stefano Soatto

Figure 1 for NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Figure 2 for NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Figure 3 for NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Figure 4 for NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Abstract:We propose NeRF-Insert, a NeRF editing framework that allows users to make high-quality local edits with a flexible level of control. Unlike previous work that relied on image-to-image models, we cast scene editing as an in-painting problem, which encourages the global structure of the scene to be preserved. Moreover, while most existing methods use only textual prompts to condition edits, our framework accepts a combination of inputs of different modalities as reference. More precisely, a user may provide a combination of textual and visual inputs including images, CAD models, and binary image masks for specifying a 3D region. We use generic image generation models to in-paint the scene from multiple viewpoints, and lift the local edits to a 3D-consistent NeRF edit. Compared to previous methods, our results show better visual quality and also maintain stronger consistency with the original NeRF.

Via

Access Paper or Ask Questions

Multi-Modal Hallucination Control by Visual Information Grounding

Mar 20, 2024

Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto

Figure 1 for Multi-Modal Hallucination Control by Visual Information Grounding

Figure 2 for Multi-Modal Hallucination Control by Visual Information Grounding

Figure 3 for Multi-Modal Hallucination Control by Visual Information Grounding

Figure 4 for Multi-Modal Hallucination Control by Visual Information Grounding

Abstract:Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.

* IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Via

Access Paper or Ask Questions