Alert button
Picture for Bin Wu

Bin Wu

Alert button

Graph Sampling-based Meta-Learning for Molecular Property Prediction

Jun 29, 2023
Xiang Zhuang, Qiang Zhang, Bin Wu, Keyan Ding, Yin Fang, Huajun Chen

Figure 1 for Graph Sampling-based Meta-Learning for Molecular Property Prediction
Figure 2 for Graph Sampling-based Meta-Learning for Molecular Property Prediction
Figure 3 for Graph Sampling-based Meta-Learning for Molecular Property Prediction
Figure 4 for Graph Sampling-based Meta-Learning for Molecular Property Prediction

Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.

* Accepted by IJCAI 2023 
Viaarxiv icon

OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting

Apr 04, 2023
Xiao He, Ye Li, Jian Tan, Bin Wu, Feifei Li

Figure 1 for OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting
Figure 2 for OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting
Figure 3 for OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting
Figure 4 for OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting

Seasonal-trend decomposition is one of the most fundamental concepts in time series analysis that supports various downstream tasks, including time series anomaly detection and forecasting. However, existing decomposition methods rely on batch processing with a time complexity of O(W), where W is the number of data points within a time window. Therefore, they cannot always efficiently support real-time analysis that demands low processing delay. To address this challenge, we propose OneShotSTL, an efficient and accurate algorithm that can decompose time series online with an update time complexity of O(1). OneShotSTL is more than $1,000$ times faster than the batch methods, with accuracy comparable to the best counterparts. Extensive experiments on real-world benchmark datasets for downstream time series anomaly detection and forecasting tasks demonstrate that OneShotSTL is from 10 to over 1,000 times faster than the state-of-the-art methods, while still providing comparable or even better accuracy.

* PVLDB 2023 
Viaarxiv icon

MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis

Mar 25, 2022
Liwen Xu, Zhengtao Wang, Bin Wu, Simon Lui

Figure 1 for MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis
Figure 2 for MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis
Figure 3 for MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis
Figure 4 for MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis

Visual Emotion Analysis (VEA) is attracting increasing attention. One of the biggest challenges of VEA is to bridge the affective gap between visual clues in a picture and the emotion expressed by the picture. As the granularity of emotions increases, the affective gap increases as well. Existing deep approaches try to bridge the gap by directly learning discrimination among emotions globally in one shot without considering the hierarchical relationship among emotions at different affective levels and the affective level of emotions to be classified. In this paper, we present the Multi-level Dependent Attention Network (MDAN) with two branches, to leverage the emotion hierarchy and the correlation between different affective levels and semantic levels. The bottom-up branch directly learns emotions at the highest affective level and strictly follows the emotion hierarchy while predicting emotions at lower affective levels. In contrast, the top-down branch attempt to disentangle the affective gap by one-to-one mapping between semantic levels and affective levels, namely, Affective Semantic Mapping. At each semantic level, a local classifier learns discrimination among emotions at the corresponding affective level. Finally, We integrate global learning and local learning into a unified deep framework and optimize the network simultaneously. Moreover, to properly extract and leverage channel dependencies and spatial attention while disentangling the affective gap, we carefully designed two attention modules: the Multi-head Cross Channel Attention module and the Level-dependent Class Activation Map module. Finally, the proposed deep framework obtains new state-of-the-art performance on six VEA benchmarks, where it outperforms existing state-of-the-art methods by a large margin, e.g., +3.85% on the WEBEmo dataset at 25 classes classification accuracy.

* Published in CVPR 2022 
Viaarxiv icon

Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

Feb 25, 2022
Kaining Mao, Wei Zhang, Deborah Baofeng Wang, Ang Li, Rongqi Jiao, Yanhui Zhu, Bin Wu, Tiansheng Zheng, Lei Qian, Wei Lyu, Minjie Ye, Jie Chen

Figure 1 for Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN
Figure 2 for Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN
Figure 3 for Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN
Figure 4 for Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

Depression is increasingly impacting individuals both physically and psychologically worldwide. It has become a global major public health problem and attracts attention from various research fields. Traditionally, the diagnosis of depression is formulated through semi-structured interviews and supplementary questionnaires, which makes the diagnosis heavily relying on physicians experience and is subject to bias. Mental health monitoring and cloud-based remote diagnosis can be implemented through an automated depression diagnosis system. In this article, we propose an attention-based multimodality speech and text representation for depression prediction. Our model is trained to estimate the depression severity of participants using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset. For the audio modality, we use the collaborative voice analysis repository (COVAREP) features provided by the dataset and employ a Bidirectional Long Short-Term Memory Network (Bi-LSTM) followed by a Time-distributed Convolutional Neural Network (T-CNN). For the text modality, we use global vectors for word representation (GloVe) to perform word embeddings and the embeddings are fed into the Bi-LSTM network. Results show that both audio and text models perform well on the depression severity estimation task, with best sequence level F1 score of 0.9870 and patient-level F1 score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level F1 score of 0.9709 and patient-level F1 score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest F1 score of 0.9580 on the patient-level depression detection task over five classes. Experiments show statistically significant improvements over previous works.

* 15 pages, 7 figures, already accepted by IEEE Transactions on Affective Computing, listed in early access now 
Viaarxiv icon

Relation-aware Hierarchical Attention Framework for Video Question Answering

May 14, 2021
Fangtao Li, Ting Bai, Chenyu Cao, Zihe Liu, Chenghao Yan, Bin Wu

Figure 1 for Relation-aware Hierarchical Attention Framework for Video Question Answering
Figure 2 for Relation-aware Hierarchical Attention Framework for Video Question Answering
Figure 3 for Relation-aware Hierarchical Attention Framework for Video Question Answering
Figure 4 for Relation-aware Hierarchical Attention Framework for Video Question Answering

Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods.

* 9 pages, This paper is accepted by ICMR 2021 
Viaarxiv icon

Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition

Oct 19, 2020
Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan, Bin Wu

Figure 1 for Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition
Figure 2 for Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition
Figure 3 for Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition
Figure 4 for Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition

Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregation feature based on feature quality. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.

* Accepted by MMM 2021 
Viaarxiv icon

Alternating minimization for a single step TV-Stokes model for image denoising

Sep 29, 2020
Bin Wu, Xue-Cheng Tai, Talal Rahman

The paper presents a fully coupled TV-Stokes model, and propose an algorithm based on alternating minimization of the objective functional whose first iteration is exactly the modified TV-Stokes model proposed earlier. The model is a generalization of the second order Total Generalized Variation model. A convergence analysis is given.

Viaarxiv icon

Multidimensional TV-Stokes for image processing

Sep 28, 2020
Bin Wu, Xue-Cheng Tai, Talal Rahman

Figure 1 for Multidimensional TV-Stokes for image processing
Figure 2 for Multidimensional TV-Stokes for image processing
Figure 3 for Multidimensional TV-Stokes for image processing
Figure 4 for Multidimensional TV-Stokes for image processing

A complete multidimential TV-Stokes model is proposed based on smoothing a gradient field in the first step and reconstruction of the multidimensional image from the gradient field. It is the correct extension of the original two dimensional TV-Stokes to multidimensions. Numerical algorithm using the Chambolle's semi-implicit dual formula is proposed. Numerical results applied to denoising 3D images and movies are presented. They show excellent performance in avoiding the staircase effect, and preserving fine structures.

Viaarxiv icon

Sparse-data based 3D surface reconstruction with vector matching

Sep 28, 2020
Bin Wu, Xue-Cheng Tai, Talal Rahman

Figure 1 for Sparse-data based 3D surface reconstruction with vector matching
Figure 2 for Sparse-data based 3D surface reconstruction with vector matching
Figure 3 for Sparse-data based 3D surface reconstruction with vector matching
Figure 4 for Sparse-data based 3D surface reconstruction with vector matching

Three dimensional surface reconstruction based on two dimensional sparse information in the form of only a small number of level lines of the surface with moderately complex structures, containing both structured and unstructured geometries, is considered in this paper. A new model has been proposed which is based on the idea of using normal vector matching combined with a first order and a second order total variation regularizers. A fast algorithm based on the augmented Lagrangian is also proposed. Numerical experiments are provided showing the effectiveness of the model and the algorithm in reconstructing surfaces with detailed features and complex structures for both synthetic and real world digital maps.

Viaarxiv icon