Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Gao

Zhejiang University

Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

Jan 04, 2024

Jiahui Peng, Jing Gao, Xin Tong, Jing Guo, Hang Yang, Jianchuan Qi, Ruiqiao Li, Nan Li, Ming Xu

Abstract:In the evolving field of corporate sustainability, analyzing unstructured Environmental, Social, and Governance (ESG) reports is a complex challenge due to their varied formats and intricate content. This study introduces an innovative methodology utilizing the "Unstructured Core Library", specifically tailored to address these challenges by transforming ESG reports into structured, analyzable formats. Our approach significantly advances the existing research by offering high-precision text cleaning, adept identification and extraction of text from images, and standardization of tables within these reports. Emphasizing its capability to handle diverse data types, including text, images, and tables, the method adeptly manages the nuances of differing page layouts and report styles across industries. This research marks a substantial contribution to the fields of industrial ecology and corporate sustainability assessment, paving the way for the application of advanced NLP technologies and large language models in the analysis of corporate governance and sustainability. Our code is available at https://github.com/linancn/TianGong-AI-Unstructure.git.

Via

Access Paper or Ask Questions

Towards Poisoning Fair Representations

Sep 28, 2023

Tianci Liu, Haoyu Wang, Feijie Wu, Hengtong Zhang, Pan Li, Lu Su, Jing Gao

Abstract:Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.

Via

Access Paper or Ask Questions

Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Jun 02, 2023

Zhuo Wang, Rongzhen Li, Bowen Dong, Jie Wang, Xiuxing Li, Ning Liu, Chenhui Mao, Wei Zhang, Liling Dong, Jing Gao(+1 more)

Figure 1 for Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Figure 2 for Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Figure 3 for Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Figure 4 for Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Abstract:Recent investigations show that large language models (LLMs), specifically GPT-4, not only have remarkable capabilities in common Natural Language Processing (NLP) tasks but also exhibit human-level performance on various professional and academic benchmarks. However, whether GPT-4 can be directly used in practical applications and replace traditional artificial intelligence (AI) tools in specialized domains requires further experimental validation. In this paper, we explore the potential of LLMs such as GPT-4 to outperform traditional AI tools in dementia diagnosis. Comprehensive comparisons between GPT-4 and traditional AI tools are conducted to examine their diagnostic accuracy in a clinical setting. Experimental results on two real clinical datasets show that, although LLMs like GPT-4 demonstrate potential for future advancements in dementia diagnosis, they currently do not surpass the performance of traditional AI tools. The interpretability and faithfulness of GPT-4 are also evaluated by comparison with real doctors. We discuss the limitations of GPT-4 in its current state and propose future research directions to enhance GPT-4 in dementia diagnosis.

* 16 pages, 6 figures

Via

Access Paper or Ask Questions

Behavioral Machine Learning? Computer Predictions of Corporate Earnings also Overreact

Mar 25, 2023

Murray Z. Frank, Jing Gao, Keer Yang

Abstract:There is considerable evidence that machine learning algorithms have better predictive abilities than humans in various financial settings. But, the literature has not tested whether these algorithmic predictions are more rational than human predictions. We study the predictions of corporate earnings from several algorithms, notably linear regressions and a popular algorithm called Gradient Boosted Regression Trees (GBRT). On average, GBRT outperformed both linear regressions and human stock analysts, but it still overreacted to news and did not satisfy rational expectation as normally defined. By reducing the learning rate, the magnitude of overreaction can be minimized, but it comes with the cost of poorer out-of-sample prediction accuracy. Human stock analysts who have been trained in machine learning methods overreact less than traditionally trained analysts. Additionally, stock analyst predictions reflect information not otherwise available to machine algorithms.

* stock analysts, machine learning, behavioral, overreaction

Via

Access Paper or Ask Questions

SimFair: A Unified Framework for Fairness-Aware Multi-Label Classification

Feb 22, 2023

Tianci Liu, Haoyu Wang, Yaqing Wang, Xiaoqian Wang, Lu Su, Jing Gao

Abstract:Recent years have witnessed increasing concerns towards unfair decisions made by machine learning algorithms. To improve fairness in model decisions, various fairness notions have been proposed and many fairness-aware methods are developed. However, most of existing definitions and methods focus only on single-label classification. Fairness for multi-label classification, where each instance is associated with more than one labels, is still yet to establish. To fill this gap, we study fairness-aware multi-label classification in this paper. We start by extending Demographic Parity (DP) and Equalized Opportunity (EOp), two popular fairness notions, to multi-label classification scenarios. Through a systematic study, we show that on multi-label data, because of unevenly distributed labels, EOp usually fails to construct a reliable estimate on labels with few instances. We then propose a new framework named Similarity $s$-induced Fairness ($s_\gamma$-SimFair). This new framework utilizes data that have similar labels when estimating fairness on a particular label group for better stability, and can unify DP and EOp. Theoretical analysis and experimental results on real-world datasets together demonstrate the advantage of over existing methods $s_\gamma$-SimFair on multi-label classification tasks.

* AAAI2023

Via

Access Paper or Ask Questions

Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Dec 01, 2022

Junde Wu, Huihui Fang, Yehui Yang, Yuanpei Liu, Jing Gao, Lixin Duan, Weihua Yang, Yanwu Xu

Figure 1 for Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Figure 2 for Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Figure 3 for Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Figure 4 for Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Abstract:In medical image segmentation, it is often necessary to collect opinions from multiple experts to make the final decision. This clinical routine helps to mitigate individual bias. But when data is multiply annotated, standard deep learning models are often not applicable. In this paper, we propose a novel neural network framework, called Multi-Rater Prism (MrPrism) to learn the medical image segmentation from multiple labels. Inspired by the iterative half-quadratic optimization, the proposed MrPrism will combine the multi-rater confidences assignment task and calibrated segmentation task in a recurrent manner. In this recurrent process, MrPrism can learn inter-observer variability taking into account the image semantic properties, and finally converges to a self-calibrated segmentation result reflecting the inter-observer agreement. Specifically, we propose Converging Prism (ConP) and Diverging Prism (DivP) to process the two tasks iteratively. ConP learns calibrated segmentation based on the multi-rater confidence maps estimated by DivP. DivP generates multi-rater confidence maps based on the segmentation masks estimated by ConP. The experimental results show that by recurrently running ConP and DivP, the two tasks can achieve mutual improvement. The final converged segmentation result of MrPrism outperforms state-of-the-art (SOTA) strategies on a wide range of medical image segmentation tasks.

Via

Access Paper or Ask Questions

Towards Reliable Item Sampling for Recommendation Evaluation

Nov 28, 2022

Dong Li, Ruoming Jin, Zhenming Liu, Bin Ren, Jing Gao, Zhi Liu

Abstract:Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are ``inconsistent'' with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either mapping the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the top-$K$ metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the ``blind spot'' issue, i.e., estimation accuracy to recover the top-$K$ metrics when $K$ is small can still be rather substantial. In this paper, we provide an in-depth investigation into these problems and make two innovative contributions. First, we propose a new item-sampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlight its subtle difference against prior work. Second, we propose a new adaptive sampling method which aims to deal with the ``blind spot'' problem and also demonstrate the expectation-maximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation, and provides strong evidence towards making item sampling a powerful and reliable tool for recommendation evaluation.

* aaai2023

Via

Access Paper or Ask Questions

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Nov 02, 2022

Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao

Figure 1 for AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Figure 2 for AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Figure 3 for AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Figure 4 for AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Abstract:Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules -- given the underlying PEFT method of choice -- introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.

* The paper is withdraw to avoid duplicate version of arXiv article 2205.12410. We will include new content as a updated version

Via

Access Paper or Ask Questions

Temporal Spatial Decomposition and Fusion Network for Time Series Forecasting

Oct 06, 2022

Liwang Zhou, Jing Gao

Figure 1 for Temporal Spatial Decomposition and Fusion Network for Time Series Forecasting

Figure 2 for Temporal Spatial Decomposition and Fusion Network for Time Series Forecasting

Figure 3 for Temporal Spatial Decomposition and Fusion Network for Time Series Forecasting

Figure 4 for Temporal Spatial Decomposition and Fusion Network for Time Series Forecasting

Abstract:Feature engineering is required to obtain better results for time series forecasting, and decomposition is a crucial one. One decomposition approach often cannot be used for numerous forecasting tasks since the standard time series decomposition lacks flexibility and robustness. Traditional feature selection relies heavily on preexisting domain knowledge, has no generic methodology, and requires a lot of labor. However, most time series prediction models based on deep learning typically suffer from interpretability issue, so the "black box" results lead to a lack of confidence. To deal with the above issues forms the motivation of the thesis. In the paper we propose TSDFNet as a neural network with self-decomposition mechanism and an attentive feature fusion mechanism, It abandons feature engineering as a preprocessing convention and creatively integrates it as an internal module with the deep model. The self-decomposition mechanism empowers TSDFNet with extensible and adaptive decomposition capabilities for any time series, users can choose their own basis functions to decompose the sequence into temporal and generalized spatial dimensions. Attentive feature fusion mechanism has the ability to capture the importance of external variables and the causality with target variables. It can automatically suppress the unimportant features while enhancing the effective ones, so that users do not have to struggle with feature selection. Moreover, TSDFNet is easy to look into the "black box" of the deep neural network by feature visualization and analyze the prediction results. We demonstrate performance improvements over existing widely accepted models on more than a dozen datasets, and three experiments showcase the interpretability of TSDFNet.

* 10 pages

Via

Access Paper or Ask Questions

An Efficient Person Clustering Algorithm for Open Checkout-free Groceries

Aug 05, 2022

Junde Wu, Yu Zhang, Rao Fu, Yuanpei Liu, Jing Gao

Figure 1 for An Efficient Person Clustering Algorithm for Open Checkout-free Groceries

Figure 2 for An Efficient Person Clustering Algorithm for Open Checkout-free Groceries

Figure 3 for An Efficient Person Clustering Algorithm for Open Checkout-free Groceries

Figure 4 for An Efficient Person Clustering Algorithm for Open Checkout-free Groceries

Abstract:Open checkout-free grocery is the grocery store where the customers never have to wait in line to check out. Developing a system like this is not trivial since it faces challenges of recognizing the dynamic and massive flow of people. In particular, a clustering method that can efficiently assign each snapshot to the corresponding customer is essential for the system. In order to address the unique challenges in the open checkout-free grocery, we propose an efficient and effective person clustering method. Specifically, we first propose a Crowded Sub-Graph (CSG) to localize the relationship among massive and continuous data streams. CSG is constructed by the proposed Pick-Link-Weight (PLW) strategy, which \textbf{picks} the nodes based on time-space information, \textbf{links} the nodes via trajectory information, and \textbf{weighs} the links by the proposed von Mises-Fisher (vMF) similarity metric. Then, to ensure that the method adapts to the dynamic and unseen person flow, we propose Graph Convolutional Network (GCN) with a simple Nearest Neighbor (NN) strategy to accurately cluster the instances of CSG. GCN is adopted to project the features into low-dimensional separable space, and NN is able to quickly produce a result in this space upon dynamic person flow. The experimental results show that the proposed method outperforms other alternative algorithms in this scenario. In practice, the whole system has been implemented and deployed in several real-world open checkout-free groceries.

Via

Access Paper or Ask Questions