Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Apr 13, 2021
Jie Song, Yeye He

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate \emph{machine-generated data} by inferring suitable data-validation "patterns" that accurately describe the underlying data domain, which minimizes false positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that Auto-Validate is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

* full version of a SIGMOD 2021 paper 

  Access Paper or Ask Questions

Human Activity Analysis and Recognition from Smartphones using Machine Learning Techniques

Mar 30, 2021
Jakaria Rabbi, Md. Tahmid Hasan Fuad, Md. Abdul Awal

Human Activity Recognition (HAR) is considered a valuable research topic in the last few decades. Different types of machine learning models are used for this purpose, and this is a part of analyzing human behavior through machines. It is not a trivial task to analyze the data from wearable sensors for complex and high dimensions. Nowadays, researchers mostly use smartphones or smart home sensors to capture these data. In our paper, we analyze these data using machine learning models to recognize human activities, which are now widely used for many purposes such as physical and mental health monitoring. We apply different machine learning models and compare performances. We use Logistic Regression (LR) as the benchmark model for its simplicity and excellent performance on a dataset, and to compare, we take Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), and Artificial Neural Network (ANN). Additionally, we select the best set of parameters for each model by grid search. We use the HAR dataset from the UCI Machine Learning Repository as a standard dataset to train and test the models. Throughout the analysis, we can see that the Support Vector Machine performed (average accuracy 96.33%) far better than the other methods. We also prove that the results are statistically significant by employing statistical significance test methods.

* Submitted to the 10th International Conference on Informatics, Electronics & Vision (ICIEV), 2021 

  Access Paper or Ask Questions

Large-Scale Training System for 100-Million Classification at Alibaba

Feb 09, 2021
Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, Yingya Zhang, Yinghui Xu, Rong Jin

In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging to train a deep model with millions of classes due to the memory and computation explosion in the last output layer. In this paper, we propose a large-scale training system to address these challenges. First, we build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs and improves the throughput of training. Then, to eliminate the communication overhead, we propose a new overlapping pipeline and a gradient sparsification method. Furthermore, we design a fast continuous convergence strategy to reduce total training iterations by adaptively adjusting learning rate and updating model parameters. With the help of all the proposed methods, we gain 3.9$\times$ throughput of our training system and reduce almost 60\% of training iterations. The experimental results show that using an in-house 256 GPUs cluster, we could train a classifier of 100 million classes on Alibaba Retail Product Dataset in about five days while achieving a comparable accuracy with the naive softmax training process.

* Accepted by KDD 2020. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020) 

  Access Paper or Ask Questions

ExpFinder: An Ensemble Expert Finding Model Integrating $N$-gram Vector Space Model and $μ$CO-HITS

Jan 18, 2021
Yong-Bin Kang, Hung Du, Abdur Rahim Mohammad Forkan, Prem Prakash Jayaraman, Amir Aryani, Timos Sellis

Finding an expert plays a crucial role in driving successful collaborations and speeding up high-quality research development and innovations. However, the rapid growth of scientific publications and digital expertise data makes identifying the right experts a challenging problem. Existing approaches for finding experts given a topic can be categorised into information retrieval techniques based on vector space models, document language models, and graph-based models. In this paper, we propose $\textit{ExpFinder}$, a new ensemble model for expert finding, that integrates a novel $N$-gram vector space model, denoted as $n$VSM, and a graph-based model, denoted as $\textit{$\mu$CO-HITS}$, that is a proposed variation of the CO-HITS algorithm. The key of $n$VSM is to exploit recent inverse document frequency weighting method for $N$-gram words and $\textit{ExpFinder}$ incorporates $n$VSM into $\textit{$\mu$CO-HITS}$ to achieve expert finding. We comprehensively evaluate $\textit{ExpFinder}$ on four different datasets from the academic domains in comparison with six different expert finding models. The evaluation results show that $\textit{ExpFinder}$ is a highly effective model for expert finding, substantially outperforming all the compared models in 19% to 160.2%.

* 15 pages, 18 figures, "for source code on Github, see", "Submitted to IEEE Transactions on Knowledge and Data Engineering" 

  Access Paper or Ask Questions

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Nov 09, 2020
Alex Tamkin, Dan Jurafsky, Noah Goodman

Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales. We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales. Our proposed BERT + Prism model can better predict masked tokens using long-range context and produces multiscale representations that perform better at utterance- and document-level tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.

* NeurIPS 2020 

  Access Paper or Ask Questions

Sampled Nonlocal Gradients for Stronger Adversarial Attacks

Nov 05, 2020
Leo Schwinn, Daniel Tenbrinck, An Nguyen, René Raab, Martin Burger, Bjoern Eskofier

The vulnerability of deep neural networks to small and even imperceptible perturbations has become a central topic in deep learning research. The evaluation of new defense mechanisms for these so-called adversarial attacks has proven to be challenging. Although several sophisticated defense mechanisms were introduced, most of them were later shown to be ineffective. However, a reliable evaluation of model robustness is mandatory for deployment in safety-critical real-world scenarios. We propose a simple yet effective modification to the gradient calculation of state-of-the-art first-order adversarial attacks, which increases their success rate and thus leads to more accurate robustness estimates. Normally, the gradient update of an attack is directly calculated for the given data point. In general, this approach is sensitive to noise and small local optima of the loss function. Inspired by gradient sampling techniques from non-convex optimization, we propose to calculate the gradient direction of the adversarial attack as the weighted average over multiple points in the local vicinity. We empirically show that by incorporating this additional gradient information, we are able to give a more accurate estimation of the global descent direction on noisy and non-convex loss surfaces. Additionally, we show that the proposed method achieves higher success rates than a variety of state-of-the-art attacks on the benchmark datasets MNIST, Fashion-MNIST, and CIFAR10.

  Access Paper or Ask Questions

Domain Adaptive Transfer Learning on Visual Attention Aware Data Augmentation for Fine-grained Visual Categorization

Oct 06, 2020
Ashiq Imran, Vassilis Athitsos

Fine-Grained Visual Categorization (FGVC) is a challenging topic in computer vision. It is a problem characterized by large intra-class differences and subtle inter-class differences. In this paper, we tackle this problem in a weakly supervised manner, where neural network models are getting fed with additional data using a data augmentation technique through a visual attention mechanism. We perform domain adaptive knowledge transfer via fine-tuning on our base network model. We perform our experiment on six challenging and commonly used FGVC datasets, and we show competitive improvement on accuracies by using attention-aware data augmentation techniques with features derived from deep learning model InceptionV3, pre-trained on large scale datasets. Our method outperforms competitor methods on multiple FGVC datasets and showed competitive results on other datasets. Experimental studies show that transfer learning from large scale datasets can be utilized effectively with visual attention based data augmentation, which can obtain state-of-the-art results on several FGVC datasets. We present a comprehensive analysis of our experiments. Our method achieves state-of-the-art results in multiple fine-grained classification datasets including challenging CUB200-2011 bird, Flowers-102, and FGVC-Aircrafts datasets.

* Will be published in ISVC 2020 
* 18 pages, 12 figures, 4 tables 

  Access Paper or Ask Questions

Understanding the temporal evolution of COVID-19 research through machine learning and natural language processing

Jul 22, 2020
Ashkan Ebadi, Pengcheng Xi, Stéphane Tremblay, Bruce Spencer, Raman Pall, Alexander Wong

The outbreak of the novel coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been continuously affecting human lives and communities around the world in many ways, from cities under lockdown to new social experiences. Although in most cases COVID-19 results in mild illness, it has drawn global attention due to the extremely contagious nature of SARS-CoV-2. Governments and healthcare professionals, along with people and society as a whole, have taken any measures to break the chain of transition and flatten the epidemic curve. In this study, we used multiple data sources, i.e., PubMed and ArXiv, and built several machine learning models to characterize the landscape of current COVID-19 research by identifying the latent topics and analyzing the temporal evolution of the extracted research themes, publications similarity, and sentiments, within the time-frame of January- May 2020. Our findings confirm the types of research available in PubMed and ArXiv differ significantly, with the former exhibiting greater diversity in terms of COVID-19 related issues and the latter focusing more on intelligent systems/tools to predict/diagnose COVID-19. The special attention of the research community to the high-risk groups and people with complications was also confirmed.

  Access Paper or Ask Questions

Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance

Apr 29, 2020
Feixiang He, Yuanhang Xiang, Xi Zhao, He Wang

Crowd simulation is a central topic in several fields including graphics. To achieve high-fidelity simulations, data has been increasingly relied upon for analysis and simulation guidance. However, the information in real-world data is often noisy, mixed and unstructured, making it difficult for effective analysis, therefore has not been fully utilized. With the fast-growing volume of crowd data, such a bottleneck needs to be addressed. In this paper, we propose a new framework which comprehensively tackles this problem. It centers at an unsupervised method for analysis. The method takes as input raw and noisy data with highly mixed multi-dimensional (space, time and dynamics) information, and automatically structure it by learning the correlations among these dimensions. The dimensions together with their correlations fully describe the scene semantics which consists of recurring activity patterns in a scene, manifested as space flows with temporal and dynamics profiles. The effectiveness and robustness of the analysis have been tested on datasets with great variations in volume, duration, environment and crowd dynamics. Based on the analysis, new methods for data visualization, simulation evaluation and simulation guidance are also proposed. Together, our framework establishes a highly automated pipeline from raw data to crowd analysis, comparison and simulation guidance. Extensive experiments and evaluations have been conducted to show the flexibility, versatility and intuitiveness of our framework.

* accepted in SIGGRAPH 2020 

  Access Paper or Ask Questions

Financial Market Trend Forecasting and Performance Analysis Using LSTM

Mar 31, 2020
Jonghyeon Min

The financial market trend forecasting method is emerging as a hot topic in financial markets today. Many challenges still currently remain, and various researches related thereto have been actively conducted. Especially, recent research of neural network-based financial market trend prediction has attracted much attention. However, previous researches do not deal with the financial market forecasting method based on LSTM which has good performance in time series data. There is also a lack of comparative analysis in the performance of neural network-based prediction techniques and traditional prediction techniques. In this paper, we propose a financial market trend forecasting method using LSTM and analyze the performance with existing financial market trend forecasting methods through experiments. This method prepares the input data set through the data preprocessing process so as to reflect all the fundamental data, technical data and qualitative data used in the financial data analysis, and makes comprehensive financial market analysis through LSTM. In this paper, we experiment and compare performances of existing financial market trend forecasting models, and performance according to the financial market environment. In addition, we implement the proposed method using open sources and platform and forecast financial market trends using various financial data indicators.

  Access Paper or Ask Questions