While achieving remarkable performances, we show that diffusion models are fragile to the presence of noisy samples, limiting their potential in the vast amount of settings where, unlike image synthesis, we are not blessed with clean data. Motivated by our finding that such fragility originates from the distribution gaps between noisy and clean samples along the diffusion process, we introduce risk-sensitive SDE, a stochastic differential equation that is parameterized by the risk (i.e., data "dirtiness") to adjust the distributions of noisy samples, reducing misguidance while benefiting from their contained information. The optimal expression for risk-sensitive SDE depends on the specific noise distribution, and we derive its parameterizations that minimize the misguidance of noisy samples for both Gaussian and general non-Gaussian perturbations. We conduct extensive experiments on both synthetic and real-world datasets (e.g., medical time series), showing that our model effectively recovers the clean data distribution from noisy samples, significantly outperforming conditional generation baselines.
Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and logically consistent code. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long code generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based stored-program automatic computer for long and consistent code generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction is executed by a separate LLM instance, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate virtually unbounded code structures, bypassing the constraints of the finite context window while producing code that fulfills complex user-specified requirements. We empirically show that L2MAC succeeds in generating large code bases for system design tasks where other coding methods fall short in implementing user requirements and provide insight into the reasons for this performance gap.
Transfer learning refers to the process of adapting a model trained on a source task to a target task. While kernel methods are conceptually and computationally simple machine learning models that are competitive on a variety of tasks, it has been unclear how to perform transfer learning for kernel methods. In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. In particular, we show that transferring modern kernels trained on large-scale image datasets can result in substantial performance increase as compared to using the same kernel trained directly on the target task. In addition, we show that transfer-learned kernels allow a more accurate prediction of the effect of drugs on cancer cell lines. For both applications, we identify simple scaling laws that characterize the performance of transfer-learned kernels as a function of the number of target examples. We explain this phenomenon in a simplified linear setting, where we are able to derive the exact scaling laws. By providing a simple and effective transfer learning framework for kernel methods, our work enables kernel methods trained on large datasets to be easily adapted to a variety of downstream target tasks.