Alert button
Picture for Liang Zhang

Liang Zhang

Alert button

AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Aug 31, 2023
Zhaoxin Huan, Ke Ding, Ang Li, Xiaolu Zhang, Xu Min, Yong He, Liang Zhang, Jun Zhou, Linjian Mo, Jinjie Gu, Zhongyi Liu, Wenliang Zhong, Guannan Zhang

Figure 1 for AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction
Figure 2 for AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction
Figure 3 for AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction
Figure 4 for AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Click-through rate (CTR) prediction is a crucial issue in recommendation systems. There has been an emergence of various public CTR datasets. However, existing datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling from multiple scenarios can provide a more comprehensive understanding of users. Existing datasets only include data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario prediction as they address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale dataset can provide a more reliable evaluation of models, fully reflecting the performance differences between models. The scale of existing datasets is around 100 million, which is relatively small compared to the real-world CTR prediction. To address these limitations, we propose AntM$^{2}$C, a Multi-Scenario Multi-Modal CTR dataset based on industrial data from Alipay. Specifically, AntM$^{2}$C provides the following advantages: 1) It covers CTR data of 5 different types of items, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM$^{2}$C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM$^{2}$C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available. Based on AntM$^{2}$C, we construct several typical CTR tasks and provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.

Viaarxiv icon

Graph Edit Distance Learning via Different Attention

Aug 26, 2023
Jiaxi Lv, Liang Zhang, Yi Huang, Jiancheng Huang, Shifeng Chen

Recently, more and more research has focused on using Graph Neural Networks (GNN) to solve the Graph Similarity Computation problem (GSC), i.e., computing the Graph Edit Distance (GED) between two graphs. These methods treat GSC as an end-to-end learnable task, and the core of their architecture is the feature fusion modules to interact with the features of two graphs. Existing methods consider that graph-level embedding is difficult to capture the differences in local small structures between two graphs, and thus perform fine-grained feature fusion on node-level embedding can improve the accuracy, but leads to greater time and memory consumption in the training and inference phases. However, this paper proposes a novel graph-level fusion module Different Attention (DiffAtt), and demonstrates that graph-level fusion embeddings can substantially outperform these complex node-level fusion embeddings. We posit that the relative difference structure of the two graphs plays an important role in calculating their GED values. To this end, DiffAtt uses the difference between two graph-level embeddings as an attentional mechanism to capture the graph structural difference of the two graphs. Based on DiffAtt, a new GSC method, named Graph Edit Distance Learning via Different Attention (REDRAFT), is proposed, and experimental results demonstrate that REDRAFT achieves state-of-the-art performance in 23 out of 25 metrics in five benchmark datasets. Especially on MSE, it respectively outperforms the second best by 19.9%, 48.8%, 29.1%, 31.6%, and 2.2%. Moreover, we propose a quantitative test Remaining Subgraph Alignment Test (RESAT) to verify that among all graph-level fusion modules, the fusion embedding generated by DiffAtt can best capture the structural differences between two graphs.

Viaarxiv icon

Explore and Tell: Embodied Visual Captioning in 3D Environments

Aug 21, 2023
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin

Figure 1 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 2 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 3 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 4 for Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.

* 12 pages; 10 figures; ICCV 2023 
Viaarxiv icon

TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models

Aug 11, 2023
Liang Zhang, Nathaniel Xu, Pengfei Yang, Gaojie Jin, Cheng-Chao Huang, Lijun Zhang

Figure 1 for TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models
Figure 2 for TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models
Figure 3 for TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models
Figure 4 for TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models

Robust pedestrian trajectory forecasting is crucial to developing safe autonomous vehicles. Although previous works have studied adversarial robustness in the context of trajectory forecasting, some significant issues remain unaddressed. In this work, we try to tackle these crucial problems. Firstly, the previous definitions of robustness in trajectory prediction are ambiguous. We thus provide formal definitions for two kinds of robustness, namely label robustness and pure robustness. Secondly, as previous works fail to consider robustness about all points in a disturbance interval, we utilise a probably approximately correct (PAC) framework for robustness verification. Additionally, this framework can not only identify potential counterexamples, but also provides interpretable analyses of the original methods. Our approach is applied using a prototype tool named TrajPAC. With TrajPAC, we evaluate the robustness of four state-of-the-art trajectory prediction models -- Trajectron++, MemoNet, AgentFormer, and MID -- on trajectories from five scenes of the ETH/UCY dataset and scenes of the Stanford Drone Dataset. Using our framework, we also experimentally study various factors that could influence robustness performance.

* ICCV 2023 version 
Viaarxiv icon

FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

Jul 14, 2023
Tianyi Shi, Xiaohuan Ding, Liang Zhang, Xin Yang

Figure 1 for FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation
Figure 2 for FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation
Figure 3 for FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation
Figure 4 for FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

Curvilinear object segmentation is critical for many applications. However, manually annotating curvilinear objects is very time-consuming and error-prone, yielding insufficiently available annotated datasets for existing supervised methods and domain adaptation methods. This paper proposes a self-supervised curvilinear object segmentation method that learns robust and distinctive features from fractals and unlabeled images (FreeCOS). The key contributions include a novel Fractal-FDA synthesis (FFS) module and a geometric information alignment (GIA) approach. FFS generates curvilinear structures based on the parametric Fractal L-system and integrates the generated structures into unlabeled images to obtain synthetic training images via Fourier Domain Adaptation. GIA reduces the intensity differences between the synthetic and unlabeled images by comparing the intensity order of a given pixel to the values of its nearby neighbors. Such image alignment can explicitly remove the dependency on absolute intensity values and enhance the inherent geometric characteristics which are common in both synthetic and real images. In addition, GIA aligns features of synthetic and real images via the prediction space adaptation loss (PSAL) and the curvilinear mask contrastive loss (CMCL). Extensive experimental results on four public datasets, i.e., XCAD, DRIVE, STARE and CrackTree demonstrate that our method outperforms the state-of-the-art unsupervised methods, self-supervised methods and traditional methods by a large margin. The source code of this work is available at https://github.com/TY-Shi/FreeCOS.

* Accepted by ICCV 2023 
Viaarxiv icon

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Jun 27, 2023
Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.

Viaarxiv icon

Movie101: A New Movie Understanding Benchmark

May 20, 2023
Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Figure 1 for Movie101: A New Movie Understanding Benchmark
Figure 2 for Movie101: A New Movie Understanding Benchmark
Figure 3 for Movie101: A New Movie Understanding Benchmark
Figure 4 for Movie101: A New Movie Understanding Benchmark

To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.

* Accepted to ACL 2023 
Viaarxiv icon

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

May 10, 2023
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin

Figure 1 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 2 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 3 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 4 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.

* Accepted by ACL 2023 main conference 
Viaarxiv icon

Biomarker Investigation using Multiple Brain Measures from MRI through XAI in Alzheimer's Disease Classification

May 03, 2023
Davide Coluzzi, Valentina Bordin, Massimo Walter Rivolta, Igor Fortel, Liang Zhang, Alex Leow, Giuseppe Baselli

Figure 1 for Biomarker Investigation using Multiple Brain Measures from MRI through XAI in Alzheimer's Disease Classification
Figure 2 for Biomarker Investigation using Multiple Brain Measures from MRI through XAI in Alzheimer's Disease Classification
Figure 3 for Biomarker Investigation using Multiple Brain Measures from MRI through XAI in Alzheimer's Disease Classification
Figure 4 for Biomarker Investigation using Multiple Brain Measures from MRI through XAI in Alzheimer's Disease Classification

Alzheimer's Disease (AD) is the world leading cause of dementia, a progressively impairing condition leading to high hospitalization rates and mortality. To optimize the diagnostic process, numerous efforts have been directed towards the development of deep learning approaches (DL) for the automatic AD classification. However, their typical black box outline has led to low trust and scarce usage within clinical frameworks. In this work, we propose two state-of-the art DL models, trained respectively on structural MRI (ResNet18) and brain connectivity matrixes (BC-GCN-SE) derived from diffusion data. The models were initially evaluated in terms of classification accuracy. Then, results were analyzed using an Explainable Artificial Intelligence (XAI) approach (Grad-CAM) to measure the level of interpretability of both models. The XAI assessment was conducted across 132 brain parcels, extracted from a combination of the Harvard-Oxford and AAL brain atlases, and compared to well-known pathological regions to measure adherence to domain knowledge. Results highlighted acceptable classification performance as compared to the existing literature (ResNet18: TPRmedian = 0.817, TNRmedian = 0.816; BC-GCN-SE: TPRmedian = 0.703, TNRmedian = 0.738). As evaluated through a statistical test (p < 0.05) and ranking of the most relevant parcels (first 15%), Grad-CAM revealed the involvement of target brain areas for both the ResNet18 and BC-GCN-SE models: the medial temporal lobe and the default mode network. The obtained interpretabilities were not without limitations. Nevertheless, results suggested that combining different imaging modalities may result in increased classification performance and model reliability. This could potentially boost the confidence laid in DL models and favor their wide applicability as aid diagnostic tools.

* 26 pages, 5 figures 
Viaarxiv icon