Today, I read the book “Applied Predictive
Modeling”. The reading part is Chapter 3, knowing how to deal with predictors
in different ways in Data Pre-processing procedure.
Summary
Missing data should not be confused with censored
data where the exact value is missing but something is known about its value.
One popular technique for imputation is a K-nearest
neighbor model. A new sample is imputed by finding the samples in the training
set ‘closest’ to it and averages these nearby points to fill in the value.
Removing predictors
A tree-based model is impervious to uninformative
predictor since it would never be used in a split. However, a model such as
linear regression would find these data problematic and is likely to cause an
error in the computations.
1. The fraction of unique values over the sample size
is low (10%),
2. The ratio of the frequency of the most prevalent
value to that of the second most prevalent value is large (20).
If both of these criteria are true and the model in
question is susceptible to this type of predictor, it may be advantageous to
remove the variable from the model.
Between-Predictor Correlations
If the first principal component accounts for a
large percentage of the variance, this implies that there is at least one group
of predictors that represent the same information.
Algorithm:
1. Calculate the correlation matrix of the predictors.
2. Determine the two predictors associated with the
largest absolute pairwise correlation.
3. Determine the average correlation between A and the
other variables. Do the same for predictor B.
4. If A has a larger average correlation, remove it;
otherwise, remove predictor B.
5. Repeat Steps 2-4 until no absolute correlations are
above the threshold.
Adding Predictors
When a predictor is categorical, it is common to
decompose the predictor into a set of more specific variables.
Binning Predictors
One common approach to simplifying a data set is to
take a numeric predictor and pre-categorize or ‘bin’ it into two or more groups
prior to data analysis.
Tomorrow, I will start try computing.
No comments:
Post a Comment