Update Everyday: Data Pre-processing Computing

Today, I read the book “Applied Predictive Modeling”. I started following the codes in the book.

Summary

Know the meaning of the following codes:

> segData=subset(segmentationOriginal, Case=="Train")

> case=segData$Case

> segData=segData[, -(1:3)]

> statusColNum=grep("Status", names(segData))

> segData=segData[, -statusColNum]

> grep("[a-z]", letters)

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

> txt <- c("arm","foot","lefroo", "bafoobar")

> if(length(i <- grep("foo", txt)))

+ cat("'foo' appears at least once in\n\t", txt, "\n")

'foo' appears at least once in

arm foot lefroo bafoobar

> i # 2 and 4

[1] 2 4

> txt[i]

Transformations

> library(e1071)

> skewness(segData$AngleCh1)

> skewValues=apply(segData, 2, skewness)

> head(skewValues)

> library(caret)

> Ch1AreaTrans=BoxCoxTrans(segData$AreaCh1)

> Ch1AreaTrans

Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.

150.0 194.0 256.0 325.1 376.0 2186.0

Largest/Smallest: 14.6

Sample Skewness: 3.53

Estimated Lambda: -0.9

> head(segData$AreaCh1)

[1] 819 431 298 256 258 358

> predict(Ch1AreaTrans, head(segData$AreaCh1))

[1] 1.108458 1.106383 1.104520 1.103554 1.103607 1.105523

> (819^(-0.9)-1)/(-0.9)

[1] 1.108458

> pcaobject=prcomp(segData, center=TRUE, scale.=TRUE)

> percentvariance=pcaobject$sd^2/sum(pcaobject$sd^2)*100

> percentvariance[1:3]

> head(pcaobject$x[, 1:5])

> head(pcaobject$rotation[, 1:5])

spatialSign(…) contains function for the spatial sign transformation

impute.knn uses K-nearest neighbors to estimate the missing data

preprocess function applies imputation methods based on K-nearest neighbors or bagged trees.

The order in which the possible transformation are applied is transformation, centering, scaling, imputation, feature extraction, and then spatial sign.

BoxCoxTrans can find the appropriate transformation and apply them to the new data by obtaining the information about data, including estimated Lambda.

Filtering

> nearZeroVar(segData)

> correlations=cor(segData)

> dim(correlations)

> correlations[1:5, 1:5]

> library(corrplot)

> corrplot(correlations, order="hclust")

> highcorr=findCorrelation(correlations, cutoff=0.75)

> length(highcorr)

> filteredsegdata=segData[, -highcorr]

Tomorrow, I will continue to learn computing.

Update Everyday

9/28/2016

Data Pre-processing Computing

No comments:

Post a Comment