Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.
General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Monitoring awkward postures is a proactive prevention for Musculoskeletal Disorders (MSDs)in construction. Machine Learning (ML) models have shown promising results for posture recognition from Wearable Sensors. However, further investigations are needed concerning: i) Incremental Learning (IL), where trained models adapt to learn new postures and control the forgetting of learned postures; ii) MSDs assessment with recognized postures. This study proposed an incremental Convolutional Long Short-Term Memory (CLN) model, investigated effective IL strategies, and evaluated MSDs assessment using recognized postures. Tests with nine workers showed the CLN model with shallow convolutional layers achieved high recognition performance (F1 Score) under personalized (0.87) and generalized (0.84) modeling. Generalized shallow CLN model under Many-to-One IL scheme can balance the adaptation (0.73) and forgetting of learnt subjects (0.74). MSDs assessment using postures recognized from incremental CLN model had minor difference with ground-truth, which demonstrates the high potential for automated MSDs monitoring in construction.