9/28/2016

Data Pre-processing Computing

Today, I read the book “Applied Predictive Modeling”. I started following the codes in the book.

Summary
Know the meaning of the following codes:
1.
> segData=subset(segmentationOriginal, Case=="Train")
> case=segData$Case
> segData=segData[, -(1:3)]
> statusColNum=grep("Status", names(segData))
> segData=segData[, -statusColNum]

2.
> grep("[a-z]", letters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
>
> txt <- c("arm","foot","lefroo", "bafoobar")
> if(length(i <- grep("foo", txt)))
+    cat("'foo' appears at least once in\n\t", txt, "\n")
'foo' appears at least once in
         arm foot lefroo bafoobar
> i # 2 and 4
[1] 2 4
> txt[i]

Transformations
3.
> library(e1071)
> skewness(segData$AngleCh1)
> skewValues=apply(segData, 2, skewness)
> head(skewValues)

4.
> library(caret)
> Ch1AreaTrans=BoxCoxTrans(segData$AreaCh1)
> Ch1AreaTrans
Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  150.0   194.0   256.0   325.1   376.0  2186.0

Largest/Smallest: 14.6
Sample Skewness: 3.53

Estimated Lambda: -0.9
> head(segData$AreaCh1)
[1] 819 431 298 256 258 358
> predict(Ch1AreaTrans, head(segData$AreaCh1))
[1] 1.108458 1.106383 1.104520 1.103554 1.103607 1.105523
> (819^(-0.9)-1)/(-0.9)
[1] 1.108458

5.
> pcaobject=prcomp(segData, center=TRUE, scale.=TRUE)
> percentvariance=pcaobject$sd^2/sum(pcaobject$sd^2)*100
> percentvariance[1:3]
> head(pcaobject$x[, 1:5])
> head(pcaobject$rotation[, 1:5])

spatialSign(…) contains function for the spatial sign transformation
impute.knn uses K-nearest neighbors to estimate the missing data
preprocess function applies imputation methods based on K-nearest neighbors or bagged trees.
The order in which the possible transformation are applied is transformation, centering, scaling, imputation, feature extraction, and then spatial sign.
BoxCoxTrans can find the appropriate transformation and apply them to the new data by obtaining the information about data, including estimated Lambda.

Filtering
1.
> nearZeroVar(segData)

2.
> correlations=cor(segData)
> dim(correlations)
> correlations[1:5, 1:5]

3.
> library(corrplot)
> corrplot(correlations, order="hclust")

4.
> highcorr=findCorrelation(correlations, cutoff=0.75)
> length(highcorr)
> filteredsegdata=segData[, -highcorr]


Tomorrow, I will continue to learn computing.

No comments:

Post a Comment