Large language models, such as the well-known ChatGPT, have brought about an unexpected revolution in the field of artificial intelligence. On the one hand, they have numerous practical applications and enormous potential still to be explored. On the other hand, they are also the subject of debate from scientific, philosophical, and social perspectives: there are doubts about the exact mechanisms of their functioning and their actual capacity for language comprehension, and their applications raise ethical dilemmas. In this chapter, we describe how this technology has been developed and the fundamentals of its operation, allowing us to better understand its capabilities and limitations and to introduce some of the main debates surrounding its development and use. -- Los grandes modelos de lenguaje, como el conocido ChatGPT, han supuesto una inesperada revoluci\'on en el \'ambito de la inteligencia artificial. Por un lado, cuentan con multitud de aplicaciones pr\'acticas y un enorme potencial todav\'ia por explorar. Por otro lado, son tambi\'en objeto de debate, tanto desde el punto de vista cient\'ifico y filos\'ofico como social: hay dudas sobre los mecanismos exactos de su funcionamiento y su capacidad real de comprensi\'on del lenguaje, y sus aplicaciones plantean dilemas \'eticos. En este cap\'itulo describimos c\'omo se ha llegado a esta tecnolog\'ia y los fundamentos de su funcionamiento, permiti\'endonos as\'i comprender mejor sus capacidades y limitaciones e introducir algunos de los principales debates que rodean su desarrollo y uso.
Formal languages theory is useful for the study of natural language. In particular, it is of interest to study the adequacy of the grammatical formalisms to express syntactic phenomena present in natural language. First, it helps to draw hypothesis about the nature and complexity of the speaker-hearer linguistic competence, a fundamental question in linguistics and other cognitive sciences. Moreover, from an engineering point of view, it allows the knowledge of practical limitations of applications based on those formalisms. In this article I introduce the adequacy problem of grammatical formalisms for natural language, also introducing some formal language theory concepts required for this discussion. Then, I review the formalisms that have been proposed in history, and the arguments that have been given to support or reject their adequacy. ----- La teor\'ia de lenguajes formales es \'util para el estudio de los lenguajes naturales. En particular, resulta de inter\'es estudiar la adecuaci\'on de los formalismos gramaticales para expresar los fen\'omenos sint\'acticos presentes en el lenguaje natural. Primero, ayuda a trazar hip\'otesis acerca de la naturaleza y complejidad de las competencias ling\"u\'isticas de los hablantes-oyentes del lenguaje, un interrogante fundamental de la ling\"u\'istica y otras ciencias cognitivas. Adem\'as, desde el punto de vista de la ingenier\'ia, permite conocer limitaciones pr\'acticas de las aplicaciones basadas en dichos formalismos. En este art\'iculo hago una introducci\'on al problema de la adecuaci\'on de los formalismos gramaticales para el lenguaje natural, introduciendo tambi\'en algunos conceptos de la teor\'ia de lenguajes formales necesarios para esta discusi\'on. Luego, hago un repaso de los formalismos que han sido propuestos a lo largo de la historia, y de los argumentos que se han dado para sostener o refutar su adecuaci\'on.
This work proposes an algorithm for taking advantage of backpropagation gradients to determine feature importance at different stages of training. Additionally, we propose a way to represent the learning process qualitatively. Experiments were performed over the Wisconsin cancer dataset provided by sklearn, and results showed an interesting convergence of the so called "learning gradients" towards the most important features. --- Este trabajo propone el algoritmo de gradientes de aprendizaje para encontrar significado en las entradas de una red neuronal. Ademas, se propone una manera de evaluarlas por orden de importancia y representar el proceso de aprendizaje a traves de las etapas de entrenamiento. Los resultados obtenidos utilizan como referencia el conjunto de datos acerca de tumores malignos y benignos en Wisconsin. Esta referencia sirvio para detectar un patron en las variables mas importantes del modelo gracias, asi como su evolucion temporal.
In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology as well as in which order it has to learn these three features significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the dataset used available to the community. -- En este art\'iculo presentamos un m\'etodo h\'ibrido para la detecci\'on autom\'atica de patolog\'ias dermatol\'ogicas en informes m\'edicos. Usamos un modelo de lenguaje amplio en espa\~nol combinado con ontolog\'ias m\'edicas para predecir, dado un informe m\'edico de primera cita o de seguimiento, la patolog\'ia del paciente. Los resultados muestran que el tipo, la gravedad y el sitio en el cuerpo de una patolog\'ia dermatol\'ogica, as\'i como en qu\'e orden tiene un modelo que aprender esas tres caracter\'isticas, aumentan su precisi\'on. El art\'iculo presenta la demostraci\'on de resultados comparables al estado del arte de clasificaci\'on de textos m\'edicos con una precisi\'on de 0.84, micro y macro F1-score de 0.82 y 0.75, y deja a disposici\'on de la comunidad tanto el m\'etodo como el conjunto de datos utilizado.
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .
This study considers the estimation of the average treatment effect (ATE). For ATE estimation, we estimate the propensity score through direct bias-correction term estimation. Let $\{(X_i, D_i, Y_i)\}_{i=1}^{n}$ be the observations, where $X_i \in \mathbb{R}^p$ denotes $p$-dimensional covariates, $D_i \in \{0, 1\}$ denotes a binary treatment assignment indicator, and $Y_i \in \mathbb{R}$ is an outcome. In ATE estimation, the bias-correction term $h_0(X_i, D_i) = \frac{1[D_i = 1]}{e_0(X_i)} - \frac{1[D_i = 0]}{1 - e_0(X_i)}$ plays an important role, where $e_0(X_i)$ is the propensity score, the probability of being assigned treatment $1$. In this study, we propose estimating $h_0$ (or equivalently the propensity score $e_0$) by directly minimizing the prediction error of $h_0$. Since the bias-correction term $h_0$ is essential for ATE estimation, this direct approach is expected to improve estimation accuracy for the ATE. For example, existing studies often employ maximum likelihood or covariate balancing to estimate $e_0$, but these approaches may not be optimal for accurately estimating $h_0$ or the ATE. We present a general framework for this direct bias-correction term estimation approach from the perspective of Bregman divergence minimization and conduct simulation studies to evaluate the effectiveness of the proposed method.
Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.
Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
Poisson denoising plays a central role in photon-limited imaging applications such as microscopy, astronomy, and medical imaging. It is common to train deep learning models for denoising using the mean-squared error (MSE) loss, which corresponds to computing the posterior mean $\mathbb{E}[x \mid y]$. When the noise is Gaussian, Tweedie's formula enables approximation of the posterior distribution through its higher-order moments. However, this connection no longer holds for Poisson denoising: while $ \mathbb{E}[x \mid y] $ still minimizes MSE, it fails to capture posterior uncertainty. We propose a new strategy for Poisson denoising based on training a log-network. Instead of predicting the posterior mean $ \mathbb{E}[x \mid y] $, the log-network is trained to learn $\mathbb{E}[\log x \mid y]$, leveraging the logarithm as a convenient parameterization for the Poisson distribution. We provide a theoretical proof that the proposed log-network enables recovery of higher-order posterior moments and thus supports posterior approximation. Experiments on simulated data show that our method matches the denoising performance of standard MMSE models while providing access to the posterior.
This technical report analyzes the challenge of "hallucinations" (false information) in LLMs applied to law. It examines their causes, manifestations, and the effectiveness of the RAG mitigation strategy, highlighting its limitations and proposing holistic optimizations. The paper explores the ethical and regulatory implications, emphasizing human oversight as an irreplaceable role. It concludes that the solution lies not in incrementally improving generative models, but in adopting a "consultative" AI paradigm that prioritizes veracity and traceability, acting as a tool to amplify, not replace, professional judgment. -- Este informe t\'ecnico analiza el desaf\'io de las "alucinaciones" (informaci\'on falsa) en los LLMs aplicados al derecho. Se examinan sus causas, manifestaciones y la efectividad de la estrategia de mitigaci\'on RAG, exponiendo sus limitaciones y proponiendo optimizaciones hol\'isticas. Se exploran las implicaciones \'eticas y regulatorias, enfatizando la supervisi\'on humana como un rol insustituible. El documento concluye que la soluci\'on no reside en mejorar incrementalmente los modelos generativos, sino en adoptar un paradigma de IA "consultiva" que priorice la veracidad y la trazabilidad, actuando como una herramienta para amplificar, y no sustituir, el juicio profesional.