Alert button
Picture for Ming Wang

Ming Wang

Alert button

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Oct 13, 2023
Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

Figure 1 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 2 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 3 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 4 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.

* Underview 
Viaarxiv icon

T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Sep 28, 2023
Ming Wang, Daling Wang, Wenfang Wu, Shi Feng, Yifei Zhang

Figure 1 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 2 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 3 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 4 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Machine learning (ML) based systems have been suffering a lack of interpretability. To address this problem, counterfactual explanations (CEs) have been proposed. CEs are unique as they provide workable suggestions to users, in addition to explaining why a certain outcome was predicted. However, the application of CEs has been hindered by two main challenges, namely general user preferences and variable ML systems. User preferences, in particular, tend to be general rather than specific feature values. Additionally, CEs need to be customized to suit the variability of ML models, while also maintaining robustness even when these validation models change. To overcome these challenges, we propose several possible general user preferences that have been validated by user research and map them to the properties of CEs. We also introduce a new method called \uline{T}ree-based \uline{C}onditions \uline{O}ptional \uline{L}inks (T-COL), which has two optional structures and several groups of conditions for generating CEs that can be adapted to general user preferences. Meanwhile, a group of conditions lead T-COL to generate more robust CEs that have higher validity when the ML model is replaced. We compared the properties of CEs generated by T-COL experimentally under different user preferences and demonstrated that T-COL is better suited for accommodating user preferences and variable ML systems compared to baseline methods including Large Language Models.

Viaarxiv icon

RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Jul 29, 2023
Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei Zhang

Figure 1 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 2 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 3 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 4 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.

Viaarxiv icon

G-STO: Sequential Main Shopping Intention Detection via Graph-Regularized Stochastic Transformer

Jun 25, 2023
Yuchen Zhuang, Xin Shen, Yan Zhao, Chaosheng Dong, Ming Wang, Jin Li, Chao Zhang

Sequential recommendation requires understanding the dynamic patterns of users' behaviors, contexts, and preferences from their historical interactions. Most existing works focus on modeling user-item interactions only from the item level, ignoring that they are driven by latent shopping intentions (e.g., ballpoint pens, miniatures, etc). The detection of the underlying shopping intentions of users based on their historical interactions is a crucial aspect for e-commerce platforms, such as Amazon, to enhance the convenience and efficiency of their customers' shopping experiences. Despite its significance, the area of main shopping intention detection remains under-investigated in the academic literature. To fill this gap, we propose a graph-regularized stochastic Transformer method, G-STO. By considering intentions as sets of products and user preferences as compositions of intentions, we model both of them as stochastic Gaussian embeddings in the latent representation space. Instead of training the stochastic representations from scratch, we develop a global intention relational graph as prior knowledge for regularization, allowing relevant shopping intentions to be distributionally close. Finally, we feed the newly regularized stochastic embeddings into Transformer-based models to encode sequential information from the intention transitions. We evaluate our main shopping intention identification model on three different real-world datasets, where G-STO achieves significantly superior performances to the baselines by 18.08% in Hit@1, 7.01% in Hit@10, and 6.11% in NDCG@10 on average.

Viaarxiv icon

Text Is All You Need: Learning Language Representations for Sequential Recommendation

May 23, 2023
Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, Julian McAuley

Figure 1 for Text Is All You Need: Learning Language Representations for Sequential Recommendation
Figure 2 for Text Is All You Need: Learning Language Representations for Sequential Recommendation
Figure 3 for Text Is All You Need: Learning Language Representations for Sequential Recommendation
Figure 4 for Text Is All You Need: Learning Language Representations for Sequential Recommendation

Sequential recommendation aims to model dynamic user behavior from historical interactions. Existing methods rely on either explicit item IDs or general textual features for sequence modeling to understand user preferences. While promising, these approaches still struggle to model cold-start items or transfer knowledge to new datasets. In this paper, we propose to model user preferences and item features as language representations that can be generalized to new items and datasets. To this end, we present a novel framework, named Recformer, which effectively learns language representations for sequential recommendation. Specifically, we propose to formulate an item as a "sentence" (word sequence) by flattening item key-value attributes described by text so that an item sequence for a user becomes a sequence of sentences. For recommendation, Recformer is trained to understand the "sentence" sequence and retrieve the next "sentence". To encode item sequences, we design a bi-directional Transformer similar to the model Longformer but with different embedding layers for sequential recommendation. For effective representation learning, we propose novel pretraining and finetuning methods which combine language understanding and recommendation tasks. Therefore, Recformer can effectively recommend the next item based on language representations. Extensive experiments conducted on six datasets demonstrate the effectiveness of Recformer for sequential recommendation, especially in low-resource and cold-start settings.

* accepted to KDD 2023 
Viaarxiv icon

Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy

Apr 25, 2023
Yang Li, Wei Wang, Ming Wang, Chunmeng Dou, Zhengyu Ma, Huihui Zhou, Peng Zhang, Nicola Lepri, Xumeng Zhang, Qing Luo, Xiaoxin Xu, Guanhua Yang, Feng Zhang, Ling Li, Daniele Ielmini, Ming Liu

Figure 1 for Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy
Figure 2 for Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy
Figure 3 for Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy
Figure 4 for Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy

Deep learning needs high-precision handling of forwarding signals, backpropagating errors, and updating weights. This is inherently required by the learning algorithm since the gradient descent learning rule relies on the chain product of partial derivatives. However, it is challenging to implement deep learning in hardware systems that use noisy analog memristors as artificial synapses, as well as not being biologically plausible. Memristor-based implementations generally result in an excessive cost of neuronal circuits and stringent demands for idealized synaptic devices. Here, we demonstrate that the requirement for high precision is not necessary and that more efficient deep learning can be achieved when this requirement is lifted. We propose a binary stochastic learning algorithm that modifies all elementary neural network operations, by introducing (i) stochastic binarization of both the forwarding signals and the activation function derivatives, (ii) signed binarization of the backpropagating errors, and (iii) step-wised weight updates. Through an extensive hybrid approach of software simulation and hardware experiments, we find that binary stochastic deep learning systems can provide better performance than the software-based benchmarks using the high-precision learning algorithm. Also, the binary stochastic algorithm strongly simplifies the neural network operations in hardware, resulting in an improvement of the energy efficiency for the multiply-and-accumulate operations by more than three orders of magnitudes.

Viaarxiv icon

DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition

Mar 27, 2023
Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Zheng Zhu, Lincheng Li, Shunli Zhang, Xin Yu

Figure 1 for DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
Figure 2 for DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
Figure 3 for DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
Figure 4 for DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition

Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Compared with other biometric technologies, gait recognition is more difficult to disguise and can be applied to the condition of long-distance without the cooperation of subjects. Thus, it has unique potential and wide application for crime prevention and social security. At present, most gait recognition methods directly extract features from the video frames to establish representations. However, these architectures learn representations from different features equally but do not pay enough attention to dynamic features, which refers to a representation of dynamic parts of silhouettes over time (e.g. legs). Since dynamic parts of the human body are more informative than other parts (e.g. bags) during walking, in this paper, we propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features. Specifically, to take full advantage of the dynamic information, we propose a Dynamic Augmentation Module (DAM), which can automatically establish spatial-temporal feature representations of the dynamic parts of the human body. The experimental results show that our DyGait network outperforms other state-of-the-art gait recognition methods. It achieves an average Rank-1 accuracy of 71.4% on the GREW dataset, 66.3% on the Gait3D dataset, 98.4% on the CASIA-B dataset and 98.3% on the OU-MVLP dataset.

Viaarxiv icon

Deep Baseline Network for Time Series Modeling and Anomaly Detection

Sep 10, 2022
Cheng Ge, Xi Chen, Ming Wang, Jin Wang

Figure 1 for Deep Baseline Network for Time Series Modeling and Anomaly Detection
Figure 2 for Deep Baseline Network for Time Series Modeling and Anomaly Detection
Figure 3 for Deep Baseline Network for Time Series Modeling and Anomaly Detection
Figure 4 for Deep Baseline Network for Time Series Modeling and Anomaly Detection

Deep learning has seen increasing applications in time series in recent years. For time series anomaly detection scenarios, such as in finance, Internet of Things, data center operations, etc., time series usually show very flexible baselines depending on various external factors. Anomalies unveil themselves by lying far away from the baseline. However, the detection is not always easy due to some challenges including baseline shifting, lacking of labels, noise interference, real time detection in streaming data, result interpretability, etc. In this paper, we develop a novel deep architecture to properly extract the baseline from time series, namely Deep Baseline Network (DBLN). By using this deep network, we can easily locate the baseline position and then provide reliable and interpretable anomaly detection result. Empirical evaluation on both synthetic and public real-world datasets shows that our purely unsupervised algorithm achieves superior performance compared with state-of-art methods and has good practical applications.

Viaarxiv icon

GaitGL: Learning Discriminative Global-Local Feature Representations for Gait Recognition

Aug 02, 2022
Beibei Lin, Shunli Zhang, Ming Wang, Lincheng Li, Xin Yu

Figure 1 for GaitGL: Learning Discriminative Global-Local Feature Representations for Gait Recognition
Figure 2 for GaitGL: Learning Discriminative Global-Local Feature Representations for Gait Recognition
Figure 3 for GaitGL: Learning Discriminative Global-Local Feature Representations for Gait Recognition
Figure 4 for GaitGL: Learning Discriminative Global-Local Feature Representations for Gait Recognition

Existing gait recognition methods either directly establish Global Feature Representation (GFR) from original gait sequences or generate Local Feature Representation (LFR) from several local parts. However, GFR tends to neglect local details of human postures as the receptive fields become larger in the deeper network layers. Although LFR allows the network to focus on the detailed posture information of each local region, it neglects the relations among different local parts and thus only exploits limited local information of several specific regions. To solve these issues, we propose a global-local based gait recognition network, named GaitGL, to generate more discriminative feature representations. To be specific, a novel Global and Local Convolutional Layer (GLCL) is developed to take full advantage of both global visual information and local region details in each layer. GLCL is a dual-branch structure that consists of a GFR extractor and a mask-based LFR extractor. GFR extractor aims to extract contextual information, e.g., the relationship among various body parts, and the mask-based LFR extractor is presented to exploit the detailed posture changes of local regions. In addition, we introduce a novel mask-based strategy to improve the local feature extraction capability. Specifically, we design pairs of complementary masks to randomly occlude feature maps, and then train our mask-based LFR extractor on various occluded feature maps. In this manner, the LFR extractor will learn to fully exploit local information. Extensive experiments demonstrate that GaitGL achieves better performance than state-of-the-art gait recognition methods. The average rank-1 accuracy on CASIA-B, OU-MVLP, GREW and Gait3D is 93.6%, 98.7%, 68.0% and 63.8%, respectively, significantly outperforming the competing methods. The proposed method has won the first prize in two competitions: HID 2020 and HID 2021.

Viaarxiv icon

EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing

Apr 30, 2022
Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin

Figure 1 for EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing
Figure 2 for EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing
Figure 3 for EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing
Figure 4 for EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing

The success of Pre-Trained Models (PTMs) has reshaped the development of Natural Language Processing (NLP). Yet, it is not easy to obtain high-performing models and deploy them online for industrial practitioners. To bridge this gap, EasyNLP is designed to make it easy to build NLP applications, which supports a comprehensive suite of NLP algorithms. It further features knowledge-enhanced pre-training, knowledge distillation and few-shot learning functionalities for large-scale PTMs, and provides a unified framework of model training, inference and deployment for real-world applications. Currently, EasyNLP has powered over ten business units within Alibaba Group and is seamlessly integrated to the Platform of AI (PAI) products on Alibaba Cloud. The source code of our EasyNLP toolkit is released at GitHub (https://github.com/alibaba/EasyNLP).

* 8 pages 
Viaarxiv icon