9/28/2016

Applied Predictive Modeling 3

Today, I read the book “Applied Predictive Modeling”. The reading part is Chapter 3, knowing how to deal with predictors in different ways in Data Pre-processing procedure.

Summary
Missing data should not be confused with censored data where the exact value is missing but something is known about its value.
One popular technique for imputation is a K-nearest neighbor model. A new sample is imputed by finding the samples in the training set ‘closest’ to it and averages these nearby points to fill in the value.

Removing predictors
A tree-based model is impervious to uninformative predictor since it would never be used in a split. However, a model such as linear regression would find these data problematic and is likely to cause an error in the computations.
1.     The fraction of unique values over the sample size is low (10%),
2.     The ratio of the frequency of the most prevalent value to that of the second most prevalent value is large (20).
If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.
Between-Predictor Correlations
If the first principal component accounts for a large percentage of the variance, this implies that there is at least one group of predictors that represent the same information.
Algorithm:
1.     Calculate the correlation matrix of the predictors.
2.     Determine the two predictors associated with the largest absolute pairwise correlation.
3.     Determine the average correlation between A and the other variables. Do the same for predictor B.
4.     If A has a larger average correlation, remove it; otherwise, remove predictor B.
5.     Repeat Steps 2-4 until no absolute correlations are above the threshold.

Adding Predictors
When a predictor is categorical, it is common to decompose the predictor into a set of more specific variables.

Binning Predictors

One common approach to simplifying a data set is to take a numeric predictor and pre-categorize or ‘bin’ it into two or more groups prior to data analysis.

Tomorrow, I will start try computing.

No comments:

Post a Comment