Abstract:Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
Abstract:The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
Abstract:Large language models (LLMs) struggle on simple tasks such as counting the number of occurrences of a letter in a word. In this paper, we investigate if ChatGPT can learn to count letters and propose an efficient solution.
Abstract:This column advocates for including artificial intelligence (AI)-specific metadata on those academic papers that are written with the help of AI in an attempt to analyze the use of such tools for disseminating research.
Abstract:The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
Abstract:One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.
Abstract:Large Language Models (LLMs) have achieved unprecedented performance on many complex tasks, being able, for example, to answer questions on almost any topic. However, they struggle with other simple tasks, such as counting the occurrences of letters in a word, as illustrated by the inability of many LLMs to count the number of "r" letters in "strawberry". Several works have studied this problem and linked it to the tokenization used by LLMs, to the intrinsic limitations of the attention mechanism, or to the lack of character-level training data. In this paper, we conduct an experimental study to evaluate the relations between the LLM errors when counting letters with 1) the frequency of the word and its components in the training dataset and 2) the complexity of the counting operation. We present a comprehensive analysis of the errors of LLMs when counting letter occurrences by evaluating a representative group of models over a large number of words. The results show a number of consistent trends in the models evaluated: 1) models are capable of recognizing the letters but not counting them; 2) the frequency of the word and tokens in the word does not have a significant impact on the LLM errors; 3) there is a positive correlation of letter frequency with errors, more frequent letters tend to have more counting errors, 4) the errors show a strong correlation with the number of letters or tokens in a word and 5) the strongest correlation occurs with the number of letters with counts larger than one, with most models being unable to correctly count words in which letters appear more than twice.
Abstract:The rise of AI and the Internet of Things is accelerating the digital transformation of society. Mobility computing presents specific barriers due to its real-time requirements, decentralization, and connectivity through wireless networks. New research on edge computing and tiny machine learning (tinyML) explores the execution of AI models on low-performance devices to address these issues. However, there are not many studies proposing agnostic architectures that manage the entire lifecycle of intelligent cyberphysical systems. This article extends a previous architecture based on FIWARE software components to implement the machine learning operations flow, enabling the management of the entire tinyML lifecycle in cyberphysical systems. We also provide a use case to showcase how to implement the FIWARE architecture through a complete example of a smart traffic system. We conclude that the FIWARE ecosystem constitutes a real reference option for developing tinyML and edge computing in cyberphysical systems.
Abstract:Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.
Abstract:The implementation of machine learning in Internet of Things devices poses significant operational challenges due to limited energy and computation resources. In recent years, significant efforts have been made to implement simplified ML models that can achieve reasonable performance while reducing computation and energy, for example by pruning weights in neural networks, or using reduced precision for the parameters and arithmetic operations. However, this type of approach is limited by the performance of the ML implementation, i.e., by the loss for example in accuracy due to the model simplification. In this article, we present adaptive resolution inference (ARI), a novel approach that enables to evaluate new tradeoffs between energy dissipation and model performance in ML implementations. The main principle of the proposed approach is to run inferences with reduced precision (quantization) and use the margin over the decision threshold to determine if either the result is reliable, or the inference must run with the full model. The rationale is that quantization only introduces small deviations in the inference scores, such that if the scores have a sufficient margin over the decision threshold, it is unlikely that the full model would have a different result. Therefore, we can run the quantized model first, and only when the scores do not have a sufficient margin, the full model is run. This enables most inferences to run with the reduced precision model and only a small fraction requires the full model, so significantly reducing computation and energy while not affecting model performance. The proposed ARI approach is presented, analyzed in detail, and evaluated using different data sets for floating-point and stochastic computing implementations. The results show that ARI can significantly reduce the energy for inference in different configurations with savings between 40% and 85%.