Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.
Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.
Alzheimer's Disease (AD) is the world leading cause of dementia, a progressively impairing condition leading to high hospitalization rates and mortality. To optimize the diagnostic process, numerous efforts have been directed towards the development of deep learning approaches (DL) for the automatic AD classification. However, their typical black box outline has led to low trust and scarce usage within clinical frameworks. In this work, we propose two state-of-the art DL models, trained respectively on structural MRI (ResNet18) and brain connectivity matrixes (BC-GCN-SE) derived from diffusion data. The models were initially evaluated in terms of classification accuracy. Then, results were analyzed using an Explainable Artificial Intelligence (XAI) approach (Grad-CAM) to measure the level of interpretability of both models. The XAI assessment was conducted across 132 brain parcels, extracted from a combination of the Harvard-Oxford and AAL brain atlases, and compared to well-known pathological regions to measure adherence to domain knowledge. Results highlighted acceptable classification performance as compared to the existing literature (ResNet18: TPRmedian = 0.817, TNRmedian = 0.816; BC-GCN-SE: TPRmedian = 0.703, TNRmedian = 0.738). As evaluated through a statistical test (p < 0.05) and ranking of the most relevant parcels (first 15%), Grad-CAM revealed the involvement of target brain areas for both the ResNet18 and BC-GCN-SE models: the medial temporal lobe and the default mode network. The obtained interpretabilities were not without limitations. Nevertheless, results suggested that combining different imaging modalities may result in increased classification performance and model reliability. This could potentially boost the confidence laid in DL models and favor their wide applicability as aid diagnostic tools.
Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA.
Road network is a critical infrastructure powering many applications including transportation, mobility and logistics in real life. To leverage the input of a road network across these different applications, it is necessary to learn the representations of the roads in the form of vectors, which is named \emph{road network representation learning} (RNRL). While several models have been proposed for RNRL, they capture the pairwise relationships/connections among roads only (i.e., as a simple graph), and fail to capture among roads the high-order relationships (e.g., those roads that jointly form a local region usually have similar features such as speed limit) and long-range relationships (e.g., some roads that are far apart may have similar semantics such as being roads in residential areas). Motivated by this, we propose to construct a \emph{hypergraph}, where each hyperedge corresponds to a set of multiple roads forming a region. The constructed hypergraph would naturally capture the high-order relationships among roads with hyperedges. We then allow information propagation via both the edges in the simple graph and the hyperedges in the hypergraph in a graph neural network context. The graph reconstruction and hypergraph reconstruction tasks are conventional ones and can capture structural information. The hyperedge classification task can capture long-range relationships between pairs of roads that belong to hyperedges with the same label. We call the resulting model \emph{HyperRoad}. We further extend HyperRoad to problem settings when additional inputs of road attributes and/or trajectories that are generated on the roads are available.
Applying reinforcement learning (RL) to traffic signal control (TSC) has become a promising solution. However, most RL-based methods focus solely on optimization within simulators and give little thought to deployment issues in the real world. Online RL-based methods, which require interaction with the environment, are limited in their interactions with the real-world environment. Additionally, acquiring an offline dataset for offline RL is challenging in the real world. Moreover, most real-world intersections prefer a cyclical phase structure. To address these challenges, we propose: (1) a cyclical offline dataset (COD), designed based on common real-world scenarios to facilitate easy collection; (2) an offline RL model called DataLight, capable of learning satisfactory control strategies from the COD; and (3) a method called Arbitrary To Cyclical (ATC), which can transform most RL-based methods into cyclical signal control. Extensive experiments using real-world datasets on simulators demonstrate that: (1) DataLight outperforms most existing methods and achieves comparable results with the best-performing method; (2) introducing ATC into some recent RL-based methods achieves satisfactory performance; and (3) COD is reliable, with DataLight remaining robust even with a small amount of data. These results suggest that the cyclical offline dataset might be enough for offline RL for TSC. Our proposed methods make significant contributions to the TSC field and successfully bridge the gap between simulation experiments and real-world applications. Our code is released on Github.
Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.
The artificial intelligence (AI) system has achieved expert-level performance in electrocardiogram (ECG) signal analysis. However, in underdeveloped countries or regions where the healthcare information system is imperfect, only paper ECGs can be provided. Analysis of real-world ECG images (photos or scans of paper ECGs) remains challenging due to complex environments or interference. In this study, we present an AI system developed to detect and screen cardiac abnormalities (CAs) from real-world ECG images. The system was evaluated on a large dataset of 52,357 patients from multiple regions and populations across the world. On the detection task, the AI system obtained area under the receiver operating curve (AUC) of 0.996 (hold-out test), 0.994 (external test 1), 0.984 (external test 2), and 0.979 (external test 3), respectively. Meanwhile, the detection results of AI system showed a strong correlation with the diagnosis of cardiologists (cardiologist 1 (R=0.794, p<1e-3), cardiologist 2 (R=0.812, p<1e-3)). On the screening task, the AI system achieved AUCs of 0.894 (hold-out test) and 0.850 (external test). The screening performance of the AI system was better than that of the cardiologists (AI system (0.846) vs. cardiologist 1 (0.520) vs. cardiologist 2 (0.480)). Our study demonstrates the feasibility of an accurate, objective, easy-to-use, fast, and low-cost AI system for CA detection and screening. The system has the potential to be used by healthcare professionals, caregivers, and general users to assess CAs based on real-world ECG images.