Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Ding

Scalable and Accurate Online Feature Selection for Big Data

Jul 28, 2016
Kui Yu, Xindong Wu, Wei Ding, Jian Pei

Figure 1 for Scalable and Accurate Online Feature Selection for Big Data

Figure 2 for Scalable and Accurate Online Feature Selection for Big Data

Figure 3 for Scalable and Accurate Online Feature Selection for Big Data

Figure 4 for Scalable and Accurate Online Feature Selection for Big Data

Feature selection is important in many big data applications. Two critical challenges closely associate with big data. Firstly, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, big data applications call for highly scalable feature selection algorithms in an online manner such that each feature can be processed in a sequential scan. We present SAOLA, a Scalable and Accurate OnLine Approach for feature selection in this paper. With a theoretical analysis on bounds of the pairwise correlations between features, SAOLA employs novel pairwise comparison techniques and maintain a parsimonious model over time in an online manner. Furthermore, to deal with upcoming features that arrive by groups, we extend the SAOLA algorithm, and then propose a new group-SAOLA algorithm for online group feature selection. The group-SAOLA algorithm can online maintain a set of feature groups that is sparse at the levels of both groups and individual features simultaneously. An empirical study using a series of benchmark real data sets shows that our two algorithms, SAOLA and group-SAOLA, are scalable on data sets of extremely high dimensionality, and have superior performance over the state-of-the-art feature selection methods.

* This paper has been accepted by the journal of ACM Transactions on Knowledge Discovery from Data (TKDD) and will be available soon

Via

Access Paper or Ask Questions

Scale Normalization

Apr 26, 2016
Henry Z. Lo, Kevin Amaral, Wei Ding

One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.

* Preliminary version submitted to ICLR workshop 2016

Via

Access Paper or Ask Questions

Rapid building detection using machine learning

Mar 14, 2016
Joseph Paul Cohen, Wei Ding, Caitlin Kuhlman, Aijun Chen, Liping Di

Figure 1 for Rapid building detection using machine learning

Figure 2 for Rapid building detection using machine learning

Figure 3 for Rapid building detection using machine learning

Figure 4 for Rapid building detection using machine learning

This work describes algorithms for performing discrete object detection, specifically in the case of buildings, where usually only low quality RGB-only geospatial reflective imagery is available. We utilize new candidate search and feature extraction techniques to reduce the problem to a machine learning (ML) classification task. Here we can harness the complex patterns of contrast features contained in training data to establish a model of buildings. We avoid costly sliding windows to generate candidates; instead we innovatively stitch together well known image processing techniques to produce candidates for building detection that cover 80-85% of buildings. Reducing the number of possible candidates is important due to the scale of the problem. Each candidate is subjected to classification which, although linear, costs time and prohibits large scale evaluation. We propose a candidate alignment algorithm to boost classification performance to 80-90% precision with a linear time algorithm and show it has negligible cost. Also, we propose a new concept called a Permutable Haar Mesh (PHM) which we use to form and traverse a search space to recover candidate buildings which were lost in the initial preprocessing phase.

* Accepted to be published in Applied Intelligence 2016

Via

Access Paper or Ask Questions

LOFS: Library of Online Streaming Feature Selection

Mar 02, 2016
Kui Yu, Wei Ding, Xindong Wu

Figure 1 for LOFS: Library of Online Streaming Feature Selection

Figure 2 for LOFS: Library of Online Streaming Feature Selection

As an emerging research direction, online streaming feature selection deals with sequentially added dimensions in a feature space while the number of data instances is fixed. Online streaming feature selection provides a new, complementary algorithmic methodology to enrich online feature selection, especially targets to high dimensionality in big data analytics. This paper introduces the first comprehensive open-source library for use in MATLAB that implements the state-of-the-art algorithms of online streaming feature selection. The library is designed to facilitate the development of new algorithms in this exciting research direction and make comparisons between the new methods and existing ones available.

* Knowledge-based Systems, 113(2016):1-3

Via

Access Paper or Ask Questions

Crater Detection via Convolutional Neural Networks

Jan 05, 2016
Joseph Paul Cohen, Henry Z. Lo, Tingting Lu, Wei Ding

Figure 1 for Crater Detection via Convolutional Neural Networks

Figure 2 for Crater Detection via Convolutional Neural Networks

Figure 3 for Crater Detection via Convolutional Neural Networks

Figure 4 for Crater Detection via Convolutional Neural Networks

Craters are among the most studied geomorphic features in the Solar System because they yield important information about the past and present geological processes and provide information about the relative ages of observed geologic formations. We present a method for automatic crater detection using advanced machine learning to deal with the large amount of satellite imagery collected. The challenge of automatically detecting craters comes from their is complex surface because their shape erodes over time to blend into the surface. Bandeira provided a seminal dataset that embodied this challenge that is still an unsolved pattern recognition problem to this day. There has been work to solve this challenge based on extracting shape and contrast features and then applying classification models on those features. The limiting factor in this existing work is the use of hand crafted filters on the image such as Gabor or Sobel filters or Haar features. These hand crafted methods rely on domain knowledge to construct. We would like to learn the optimal filters and features based on training examples. In order to dynamically learn filters and features we look to Convolutional Neural Networks (CNNs) which have shown their dominance in computer vision. The power of CNNs is that they can learn image filters which generate features for high accuracy classification.

* 2 Pages. Submitted to 47th Lunar and Planetary Science Conference (LPSC 2016)

Via

Access Paper or Ask Questions

A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data

Oct 07, 2015
Dongping Fang, Elizabeth Oberlin, Wei Ding, Samuel P. Kounaves

Figure 1 for A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data

Figure 2 for A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data

Figure 3 for A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data

Figure 4 for A Common-Factor Approach for Multivariate Data Cleaning with an Application to Mars Phoenix Mission Data

Data quality is fundamentally important to ensure the reliability of data for stakeholders to make decisions. In real world applications, such as scientific exploration of extreme environments, it is unrealistic to require raw data collected to be perfect. As data miners, when it is infeasible to physically know the why and the how in order to clean up the data, we propose to seek the intrinsic structure of the signal to identify the common factors of multivariate data. Using our new data driven learning method, the common-factor data cleaning approach, we address an interdisciplinary challenge on multivariate data cleaning when complex external impacts appear to interfere with multiple data measurements. Existing data analyses typically process one signal measurement at a time without considering the associations among all signals. We analyze all signal measurements simultaneously to find the hidden common factors that drive all measurements to vary together, but not as a result of the true data measurements. We use common factors to reduce the variations in the data without changing the base mean level of the data to avoid altering the physical meaning.

* 12 pages, 10 figures, 1 table

Via

Access Paper or Ask Questions