Today, I read the book “Applied Predictive
Modeling”. I started following the codes in the book.
Summary
Know the meaning of the following codes:
1.
> segData=subset(segmentationOriginal,
Case=="Train")
> case=segData$Case
> segData=segData[, -(1:3)]
> statusColNum=grep("Status",
names(segData))
> segData=segData[, -statusColNum]
2.
> grep("[a-z]", letters)
[1] 1
2 3 4 5 6 7
8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26
>
> txt <-
c("arm","foot","lefroo", "bafoobar")
> if(length(i <- grep("foo", txt)))
+
cat("'foo' appears at least once in\n\t", txt, "\n")
'foo' appears at least once in
arm
foot lefroo bafoobar
> i # 2 and 4
[1] 2 4
> txt[i]
Transformations
3.
>
library(e1071)
> skewness(segData$AngleCh1)
> skewValues=apply(segData, 2, skewness)
> head(skewValues)
4.
> library(caret)
> Ch1AreaTrans=BoxCoxTrans(segData$AreaCh1)
> Ch1AreaTrans
Box-Cox Transformation
1009 data points used to estimate Lambda
Input data summary:
Min. 1st
Qu. Median Mean 3rd Qu. Max.
150.0 194.0
256.0 325.1 376.0
2186.0
Largest/Smallest: 14.6
Sample Skewness: 3.53
Estimated Lambda: -0.9
> head(segData$AreaCh1)
[1] 819 431 298 256 258 358
> predict(Ch1AreaTrans, head(segData$AreaCh1))
[1] 1.108458 1.106383 1.104520 1.103554 1.103607
1.105523
> (819^(-0.9)-1)/(-0.9)
[1] 1.108458
5.
> pcaobject=prcomp(segData, center=TRUE,
scale.=TRUE)
>
percentvariance=pcaobject$sd^2/sum(pcaobject$sd^2)*100
> percentvariance[1:3]
> head(pcaobject$x[, 1:5])
> head(pcaobject$rotation[, 1:5])
spatialSign(…) contains function for the spatial
sign transformation
impute.knn uses K-nearest neighbors to estimate the
missing data
preprocess function applies imputation methods
based on K-nearest neighbors or bagged trees.
The order in which the possible transformation are
applied is transformation, centering, scaling, imputation, feature extraction, and
then spatial sign.
BoxCoxTrans can find the appropriate transformation
and apply them to the new data by obtaining the information about data, including
estimated Lambda.
Filtering
1.
> nearZeroVar(segData)
2.
> correlations=cor(segData)
> dim(correlations)
> correlations[1:5, 1:5]
3.
> library(corrplot)
> corrplot(correlations,
order="hclust")
4.
> highcorr=findCorrelation(correlations,
cutoff=0.75)
> length(highcorr)
> filteredsegdata=segData[, -highcorr]
Tomorrow, I will continue to learn computing.