Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.
Internet memes have emerged as a novel format for communication and expressing ideas on the web. Their fluidity and creative nature are reflected in their widespread use, often across platforms and occasionally for unethical or harmful purposes. While computational work has already analyzed their high-level virality over time and developed specialized classifiers for hate speech detection, there have been no efforts to date that aim to holistically track, identify, and map internet memes posted on social media. To bridge this gap, we investigate whether internet memes across social media platforms can be contextualized by using a semantic repository of knowledge, namely, a knowledge graph. We collect thousands of potential internet meme posts from two social media platforms, namely Reddit and Discord, and perform an extract-transform-load procedure to create a data lake with candidate meme posts. By using vision transformer-based similarity, we match these candidates against the memes cataloged in a recently released knowledge graph of internet memes, IMKG. We provide evidence that memes published online can be identified by mapping them to IMKG. We leverage this grounding to study the prevalence of memes on different platforms, discover popular memes, and select common meme channels and subreddits. Finally, we illustrate how the grounding can enable users to get context about memes on social media thanks to their link to the knowledge graph.
Downstream applications often require text classification models to be accurate, robust, and interpretable. While the accuracy of the stateof-the-art language models approximates human performance, they are not designed to be interpretable and often exhibit a drop in performance on noisy data. The family of PrototypeBased Networks (PBNs) that classify examples based on their similarity to prototypical examples of a class (prototypes) is natively interpretable and shown to be robust to noise, which enabled its wide usage for computer vision tasks. In this paper, we study whether the robustness properties of PBNs transfer to text classification tasks. We design a modular and comprehensive framework for studying PBNs, which includes different backbone architectures, backbone sizes, and objective functions. Our evaluation protocol assesses the robustness of models against character-, word-, and sentence-level perturbations. Our experiments on three benchmarks show that the robustness of PBNs transfers to NLP classification tasks facing realistic perturbations. Moreover, the robustness of PBNs is supported mostly by the objective function that keeps prototypes interpretable, while the robustness superiority of PBNs over vanilla models becomes more salient as datasets get more complex.
Multimodal Large Language Models (LLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. Our code and data are publicly available.
The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
Analogical reasoning is one of the prime abilities of humans and is linked to creativity and scientific discoveries. This ability has been studied extensively in natural language processing (NLP) as well as in cognitive psychology by proposing various benchmarks and evaluation setups. Yet, a substantial gap exists between evaluations of analogical reasoning in cognitive psychology and NLP. Our aim is to bridge this by computationally adapting theories related to analogical reasoning from cognitive psychology in the context of narratives and developing an evaluation framework large in scale. More concretely, we propose the task of matching narratives based on system mappings and release the Analogical Reasoning on Narratives (ARN) dataset. To create the dataset, we devise a framework inspired by cognitive psychology theories about analogical reasoning to utilize narratives and their components to form mappings of different abstractness levels. These mappings are then leveraged to create pairs of analogies and disanalogies/distractors with more than 1k triples of query narratives, analogies, and distractors. We cover four categories of far/near analogies and far/near distractors that allow us to study analogical reasoning in models from distinct perspectives. In this study, we evaluate different large language models (LLMs) on this task. Our results demonstrate that LLMs struggle to recognize higher-order mappings when they are not accompanied by lower-order mappings (far analogies) and show better performance when all mappings are present simultaneously (near analogies). We observe that in all the settings, the analogical reasoning abilities of LLMs can be easily impaired by near distractors that form lower-order mappings with the query narratives.
Food computing has emerged as a prominent multidisciplinary field of research in recent years. An ambitious goal of food computing is to develop end-to-end intelligent systems capable of autonomously producing recipe information for a food image. Current image-to-recipe methods are retrieval-based and their success depends heavily on the dataset size and diversity, as well as the quality of learned embeddings. Meanwhile, the emergence of powerful attention-based vision and language models presents a promising avenue for accurate and generalizable recipe generation, which has yet to be extensively explored. This paper proposes FIRE, a novel multimodal methodology tailored to recipe generation in the food computing domain, which generates the food title, ingredients, and cooking instructions based on input food images. FIRE leverages the BLIP model to generate titles, utilizes a Vision Transformer with a decoder for ingredient extraction, and employs the T5 model to generate recipes incorporating titles and ingredients as inputs. We showcase two practical applications that can benefit from integrating FIRE with large language model prompting: recipe customization to fit recipes to user preferences and recipe-to-code transformation to enable automated cooking processes. Our experimental findings validate the efficacy of our proposed approach, underscoring its potential for future advancements and widespread adoption in food computing.
Knowledge engineering is the process of creating and maintaining knowledge-producing systems. Throughout the history of computer science and AI, knowledge engineering workflows have been widely used because high-quality knowledge is assumed to be crucial for reliable intelligent agents. However, the landscape of knowledge engineering has changed, presenting four challenges: unaddressed stakeholder requirements, mismatched technologies, adoption barriers for new organizations, and misalignment with software engineering practices. In this paper, we propose to address these challenges by developing a reference architecture using a mainstream software methodology. By studying the requirements of different stakeholders and eras, we identify 23 essential quality attributes for evaluating reference architectures. We assess three candidate architectures from recent literature based on these attributes. Finally, we discuss the next steps towards a comprehensive reference architecture, including prioritizing quality attributes, integrating components with complementary strengths, and supporting missing socio-technical requirements. As this endeavor requires a collaborative effort, we invite all knowledge engineering researchers and practitioners to join us.
Intelligent Traffic Monitoring (ITMo) technologies hold the potential for improving road safety/security and for enabling smart city infrastructure. Understanding traffic situations requires a complex fusion of perceptual information with domain-specific and causal commonsense knowledge. Whereas prior work has provided benchmarks and methods for traffic monitoring, it remains unclear whether models can effectively align these information sources and reason in novel scenarios. To address this assessment gap, we devise three novel text-based tasks for situational reasoning in the traffic domain: i) BDD-QA, which evaluates the ability of Language Models (LMs) to perform situational decision-making, ii) TV-QA, which assesses LMs' abilities to reason about complex event causality, and iii) HDT-QA, which evaluates the ability of models to solve human driving exams. We adopt four knowledge-enhanced methods that have shown generalization capability across language reasoning tasks in prior work, based on natural language inference, commonsense knowledge-graph self-supervision, multi-QA joint training, and dense retrieval of domain information. We associate each method with a relevant knowledge source, including knowledge graphs, relevant benchmarks, and driving manuals. In extensive experiments, we benchmark various knowledge-aware methods against the three datasets, under zero-shot evaluation; we provide in-depth analyses of model performance on data partitions and examine model predictions categorically, to yield useful insights on traffic understanding, given different background knowledge and reasoning strategies.
* 11 pages, 6 figures, 5 tables, camera ready version of SIGKDD 2023
Visual Question Answering is a challenging task, as it requires seamless interaction between perceptual, linguistic, and background knowledge systems. While the recent progress of visual and natural language models like BLIP has led to improved performance on this task, we lack understanding of the ability of such models to perform on different kinds of questions and reasoning types. As our initial analysis of BLIP-family models revealed difficulty with answering fine-detail questions, we investigate the following question: Can visual cropping be employed to improve the performance of state-of-the-art visual question answering models on fine-detail questions? Given the recent success of the BLIP-family models, we study a zero-shot and a fine-tuned BLIP model. We define three controlled subsets of the popular VQA-v2 benchmark to measure whether cropping can help model performance. Besides human cropping, we devise two automatic cropping strategies based on multi-modal embedding by CLIP and BLIP visual QA model gradients. Our experiments demonstrate that the performance of BLIP model variants can be significantly improved through human cropping, and automatic cropping methods can produce comparable benefits. A deeper dive into our findings indicates that the performance enhancement is more pronounced in zero-shot models than in fine-tuned models and more salient with smaller bounding boxes than larger ones. We perform case studies to connect quantitative differences with qualitative observations across question types and datasets. Finally, we see that the cropping enhancement is robust, as we gain an improvement of 4.59% (absolute) in the general VQA-random task by simply inputting a concatenation of the original and gradient-based cropped images. We make our code available to facilitate further innovation on visual cropping methods for question answering.