11/30/2016

Chapter 5&6(6.1 6.2) application

Today, I read Chapter 5&6(6.1 6.2) to find something useful which could be used in my research.

Summary


Chapter 5 Application
Quantitative measures of performance
RMSE: the square root of MSE so that it is in the same units as the original data
: the proportion of the information in the data that is explained by the model. It is a measure of correlation, not accuracy.
The variance-bias trade-off: high variance or high bias, that is a question. Increasing bias can reduce model variance.

Chapter 6 Application
Ordinary linear regression finds parameter estimates that have minimum bias, whereas ridge regression, the lasso, and the elastic net find estimates that have lower variance.
Linear regression:

SSE: sum-of-squared errors.
When there are highly correlated predictors, we should remove one of the offending predictors or find models which can tolerate collinearity.

Tomorrow, I will read more of the book to frame questions.

11/29/2016

Chapter 4 application

Today, I read Chapter 4 to find something which could be used in my research.

Summary
Over-fitting
Model tuning: tune an appropriate value of a tuning parameter
Data splitting: maximum dissimilarity sampling

Common steps in model building:
1.      Pre-processing the predictor data
2.      Estimating model parameters
3.      Selecting predictors for the model
4.      Evaluating model performance
5.      Fine tuning class prediction rules (via ROC curves, etc)
Resampling techniques:
1.      K-fold cross validation
2.      Generalized cross validation (approximate the leave-one-out error rate)

For example, in the last leave-one-out cross validation, df=1 and n=12.
3.      Repeated training/testing splits
4.      Bootstrap
Choose final tuning parameters with different considerations of various factors
Choosing between models:
1. Start with several models that are the least interpretable and most flexible.
2. investigate simpler models that are less opaque

3. consider using simplest model that reasonably approximates the performance of the more complex methods

11/28/2016

Chapter 3 application 2

Today, I finished reading chapter 3 and found something which could be applied to my research. In addition, I looked into IP to try to know what do all the data mean. I have known some of them basically.

Summary










Tomorrow, I will read chapter 4 to find more information which may be useful for my research.

11/22/2016

Chapter 3 application

Today, I browsed the rest of the paper and started to look into the data. There are many ways of data pre-processing in chapter 3 and they can be applied into the log data.

Summary

paper:


Chapter 3:
1. Individual predictors:
Centering: the predictor has a zero mean
Scaling: each value of the predictor variable is divided by its standard deviation. Scaling the data coerce the values to have a common standard deviation of one.
Removing skewness: Box-Cox transformation

After holiday, I will start to frame questions of each chapter and try to apply them into the data.



11/21/2016

Preparing for using data

Today, I started to look into data in IP and tried to be familiar with IP software. Also, Yifu gave me a paper which is written by the people from Hess Corporation. I read a little of it.

Summary
SPWLA paper
TOC: total organic carbon

According the paper, the lithology is very complicated. I think if I need to predict NMR permeability with R method, the lithology of the data and the lithology of where I need to predict should be similar.


Tomorrow, I will continue to read the paper and look into the data.

11/18/2016

Chapter 10 (Computing finished)

Today, I fixed some problems of computing of Chapter 10.

Codes:
pp2=preProcess(age28data[, 1:7], "pca")
pca1=predict(pp2, age28data[, 1:7])
head(pca1)
pca1$Data="Training Set"
pca1$Data[startpoints]="Starting Values"
pca3=predict(pp2, cbresults[, 1:7])
pca3$Data="Cubist"
head(pca3)
pca4=predict(pp2, nnetresults[, 1:7])
pca4$Data="Neural Network"
head(pca4)
pcadata=rbind(pca1, pca3, pca4)
dim(pcadata)
pcadata$Data=factor(pcadata$Data,
                    levels = c("Training Set","Starting Values",
                    "Cubist","Neural Network"))
lim=extendrange(pcadata[, 1:2])

# convert a vector of strings into a vector of numbers
m=with(pcadata, as.numeric(levels(factor(as.numeric(factor(pcadata$Data))))))
m=with(pcadata, levels(factor(Data)))
m
colm=c()
for(i in 1:length(pcadata$Data))
{
  colm[i]=which(pcadata$Data[i]==m)
}
colm

xyplot(PC2 ~ PC1, data = pcadata, groups = Data,
       auto.key = list(columns = 2),
       xlim = lim, ylim = lim,
       col=colm,
       type = c("g", "p")
      )

xyplot(PC2 ~ PC1, data = pcadata, groups = Data,
       auto.key = list(columns = 2),
       xlim = lim, ylim = lim,
       col=c("blue", "pink", "red", "green"),
       type = c("g", "p")
)

Next week, I will start to look into the NMR data.

11/17/2016

Chapter 10 computing

Today, I did computing of Chapter 10.

Codes:
modelPrediction=function(x, mod)
{
  if(x[1] < 0 | x[1] > 1) return(10^38)
  if(x[2] < 0 | x[2] > 1) return(10^38)
  if(x[3] < 0 | x[3] > 1) return(10^38)
  if(x[4] < 0 | x[4] > 1) return(10^38)
  if(x[5] < 0 | x[5] > 1) return(10^38)
  if(x[6] < 0 | x[6] > 1) return(10^38)
  x=c(x, 1 - sum(x))
  if(x[7] < 0.05) return(10^38)
  tmp <- as.data.frame(t(x))
  names(tmp)=c('Cement','BlastFurnaceSlag','FlyAsh',
               'Superplasticizer','CoarseAggregate',
               'FineAggregate', 'Water')
  tmp$Age=28
  -predict(mod, tmp)
}

# Cubist mdoel
cbresults=startingvalues
cbresults$Water=NA
cbresults$Prediction=NA
cbresults
for(i in 1:nrow(cbresults))
{
  results=optim(unlist(cbresults[i,1:6]),
                     modelPrediction,
                     method = "Nelder-Mead",
                    control=list(maxit=5000),
                      mod = cbmodel)
  cbresults$Prediction[i]=-results$value
  # save the final mixture values
  cbresults[i,1:6]=results$par
}
#col.sums <- apply(x, 2, sum)  row.sums <- apply(x, 1, sum)  rbind  cbind
cbresults$Water=1 - apply(cbresults[,1:6], 1, sum)
# Keep the top three mixtures, -: decreasing order
cbresults=cbresults[order(-cbresults$Prediction),][1:3,]
cbresults$Model="Cubist"
cbresults

       Cement BlastFurnaceSlag      FlyAsh Superplasticizer CoarseAggregate FineAggregate      Water Prediction  Model
523 0.1747206       0.17969922 0.012880890     0.0027363548       0.3200175     0.2532112 0.05673427  102.44370 Cubist
495 0.2098331       0.05295034 0.006751067     0.0003692619       0.3538669     0.3238973 0.05233208   86.01767 Cubist
477 0.2110388       0.04514574 0.062017524     0.0023323792       0.3178446     0.3077021 0.05391885   85.51025 Cubist

# neural network model
nnetresults=startingvalues
nnetresults$Water=NA
nnetresults$Prediction=NA
nnetresults
for(i in 1:nrow(nnetresults))
{
  results=optim(unlist(nnetresults[i,1:6]),
                modelPrediction,
                method = "Nelder-Mead",
                control=list(maxit=5000),
                mod = nnetmodel)
  nnetresults$Prediction[i]=-results$value
  nnetresults[i,1:6]=results$par
}
nnetresults$Water=1 - apply(nnetresults[,1:6], 1, sum)
nnetresults=nnetresults[order(-nnetresults$Prediction),][1:3,]
nnetresults$Model="NNet"
nnetresults

       Cement BlastFurnaceSlag       FlyAsh Superplasticizer CoarseAggregate FineAggregate      Water Prediction Model
69  0.1617514       0.13291706 5.509921e-10      0.007644831       0.3905959     0.2564540 0.05063673   81.92797  NNet
357 0.1366392       0.04989215 2.395829e-08      0.002012726       0.4513948     0.3012054 0.05885566   80.16362  NNet
477 0.2145572       0.04306283 4.614330e-02      0.001330241       0.2993219     0.3299018 0.06568276   79.59500  NNet
(I am faced with some problems of drawing graphs. Tomorrow, I will try to fix them.)

Tomorrow, I will fix some problems of the codes.

11/16/2016

Chapter 10 computing

Today, I continued to do computing of Chapter 10.

Codes:
# plot the RMSE values
parallelplot(allresamples)
# using R2
parallelplot(allresamples, metric="Rsquared")


nnetpredictions=predict(nnetmodel, testset)
gbmpredictions=predict(gbmmodel, testset)
cbpredictions=predict(cbmodel, testset)

age28data=subset(trainingset, Age==28)
dim(age28data)
# remove the age and compressive strength columns and
# then center and scale the predictor columns
pp1=preProcess(age28data[,-(8:9)], c("center", "scale"))
scaledtrain=predict(pp1, age28data[, 1:7])
dim(scaledtrain)
# a single random mixture is selected to initialize the maximum dissimilarity sampling process
set.seed(91)
startmixture=sample(1:nrow(age28data), 1)
starters=scaledtrain[startmixture, 1:7]
#select 14 more mixtures to complete a diverse set of starting points for the search algorithms
library(proxy)
maxdis=maxDissim(starters, scaledtrain, 14)
maxdis
startpoints=c(startmixture, maxdis)
starters=age28data[startpoints, 1:7]
starters
# all seven mixture proportions should add to one
# the water proportion will be determined by the sum of the other six ingredient proportions
# remove water
startingvalues=starters[, -4]

Tomorrow, I will continue to do computing of Chapter 10.

11/15/2016

Chapter 10 computing

Today, I did computing of Chapter 10.

Codes:
library(AppliedPredictiveModeling)
data(concrete)
str(concrete)
str(mixtures)
library(Hmisc)
library(caret)
# g: grid; p: points; smooth: smoother between: add space between panels.
featurePlot(x=concrete[,-9], y=concrete$CompressiveStrength,
            between=list(x=1, y=1),
            type=c("g", "p", "smooth"))
# ? averaging the replicated mixtures and splitting the data into training and test sets
library(plyr)
averaged=ddply(mixtures, .(Cement, BlastFurnaceSlag, FlyAsh, Water, Superplasticizer, CoarseAggregate,
                           FineAggregate, Age), function(x) c(CompressiveStrength=mean(x$CompressiveStrength)))
str(averaged)
set.seed(975)
fortraining=createDataPartition(averaged$CompressiveStrength, p=3/4)[[1]]
trainingset=averaged[fortraining,]
testset=averaged[-fortraining,]
dim(trainingset)
dim(testset)
# The dot in the formula below is shorthand for all predictors and
# (.)^2 expands into a model with all the linear terms and all two-factor interactions.
modformula=paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + ",
                 "I(BlastFurnaceSlag^2)+I(FlyAsh^2)+I(Water^2)+",
                 "I(Superplasticizer^2)+I(CoarseAggregate^2)+",
                 "I(FineAggregate^2)+I(Age^2)")
modformula=as.formula(modformula)
modformula
# the predictors' number of modformula is different from the book, but after the test of a different one, modformula2,
modformula2=paste("CompressiveStrength ~ (.)^2")
modformula2=as.formula(modformula2)
modformula2
# modformula and modformula2 have different results of linearreg although their predictors' numbers are the same.
# so I guess that modformula is correct.
controlobject=trainControl(method="repeatedcv", repeats=5, number=10)
set.seed(669)
linearreg=train(modformula, data=trainingset, method="lm", trControl=controlobject)
linearreg

# the other two linear models were created
set.seed(669)
plsmodel=train(modformula, data=trainingset, method="pls", preProcess = c("center", "scale"),
               tuneLength = 15, trControl = controlobject)
plsmodel
# centered (44) and scaled (44): 44 is right.
library(elasticnet)
enetgrid2=expand.grid(.lambda=c(0, 0.001, 0.01, 0.1),
                      .fraction=seq(0.05, 1, length=20))
set.seed(669)
enetmodel2=train(modformula, data=trainingset, method="enet", preProcess = c("center", "scale"),
                 tuneGrid=enetgrid2, trControl=controlobject)
enetmodel2

# MARS
library(earth)
set.seed(669)
earthmodel=train(CompressiveStrength~., data=trainingset, method="earth",
                 tuneGrid=expand.grid(.degree=1, .nprune=2:25), trControl=controlobject)
earthmodel

#SVMs
library(kernlab)
set.seed(669)
svmmodel=train(CompressiveStrength~., data=trainingset, method="svmRadial",
               tuneLength=15, preProcess=c("center", "scale"), trControl=controlobject)
svmmodel

# Neural Networks
library(nnet)
nnetgrid=expand.grid(.decay=c(0.001, 0.01, 0.1), .size=seq(1, 27, by=2), .bag=FALSE)
set.seed(669)
nnetmodel=train(CompressiveStrength~., data=trainingset, method="avNNet",
                tuneGrid=nnetgrid, preProcess=c("center", "scale"),
                linout=TRUE, trace=FALSE, maxit=1000, trControl=controlobject)
nnetmodel

# regression and model trees
library(rpart)
set.seed(669)
rpartmodel=train(CompressiveStrength~., data=trainingset, method="rpart",
                 tuneLength=30, trControl=controlobject)
rpartmodel

set.seed(669)
library(party)
ctreemodel=train(CompressiveStrength~., data=trainingset, method="ctree",
                 tuneLength=10, trControl=controlobject)
ctreemodel

set.seed(669)
library(RWeka)
mtmodel=train(CompressiveStrength~., data=trainingset, method="M5",
              trControl=controlobject)
mtmodel

# remaining model objects
library(ipred)
library(plyr)
library(e1071)
set.seed(669)
treebagmodel=train(CompressiveStrength~., data=trainingset, method="treebag",
                   trControl=controlobject)
treebagmodel

library(randomForest)
set.seed(669)
rfmodel=train(CompressiveStrength~., data=trainingset, method="rf",
              tuneLength=7, ntrees=1000, importance=TRUE,
              trControl=controlobject)
rfmodel

gbmgrid=expand.grid(.interaction.depth = seq(1, 7, by = 2),
                    .n.trees = seq(100, 1000, by = 50),
                    .shrinkage = c(0.01, 0.1), .n.minobsinnode=10)
library(gbm)
set.seed(669)
gbmmodel=train(CompressiveStrength~., data=trainingset, method="gbm",
               tuneGrid=gbmgrid, verbose=FALSE, trControl=controlobject)
gbmmodel

cubistgrid=expand.grid(.committees = c(1, 5, 10, 50, 75, 100),
                       .neighbors = c(0, 1, 3, 5, 7, 9))
library(Cubist)
set.seed(669)
cbmodel=train(CompressiveStrength~., data=trainingset, method="cubist",
              tuneGrid=cubistgrid, trControl=controlobject)
cbmodel
allresamples=resamples(list("Linear Reg"=lmmodel), "PLS"=plsmodel,
                       "Elastic Net"=enetmodel2, "MARS"=earthmodel,
                       "SVM"=svmmodel,
                       # "Neural Networks"=nnetmodel,
                       "CART"=rpartmodel, "Cond Inf Tree" = ctreemodel,
                       "Bagged Tree" = treebagmodel, "Boosted Tree" = gbmmodel,
                       "Random Forest" = rfmodel, "Cubist" = cbmodel))
(There should be a figure of comparison among these models. However, the nnetmodel takes so much time to calculate that it has not shown the result till now. I will keep it calculating all the night and it should show the result tomorrow.)

Tomorrow, I will continue to do computing of Chapter 10.

11/14/2016

Chapter 10 (10.1 10.2 10.3)

Today, I read Chapter 10 (10.1 10.2 10.3) and started to do computing of Chapter 10.

Summary
Case Study: Compressive Strength of Concrete Mixtures
A balanced design is one where no one experimental factor (i.e., the predictors) has more focus than the others. In most cases, this means that each predictor has the same number of possible levels and that the frequencies of the levels are equivalent for each factor.
Sequential experimentation
1.     Screen a large number of possible experimental factors to determine important factors
2.     More focused experiments are created with subset of important factors
3.     The nature of relationship between the important factors can be further elucidated
4.     Fine-tune a small number of important factors
In this chapter, models will be created to help find potential recipes to maximize compressive strength.
Model Building Strategy
A data splitting approach will be taken for this case study. A random holdout set of 25 % (n = 247) will be used as a test set and five repeats of 10-fold cross-validation will be used to tune the various models
Model Performance
The top performing models were tree ensembles (random forest and boosting), rule ensembles (Cubist), and neural networks.
Optimizing Compressive Strength
Many rely on determining the gradient (i.e., first derivative) of the prediction equation. Several of the models have smooth prediction equations (e.g., neural networks and SVMs).
Neural networks: smooth model

Cubist: non-smooth model

Computing
library(AppliedPredictiveModeling)
data(concrete)
str(concrete)
str(mixtures)
library(Hmisc)
library(caret)
# g: grid; p: points; smooth: smoother between: add space between panels.
featurePlot(x=concrete[,-9], y=concrete$CompressiveStrength,
            between=list(x=1, y=1),
            type=c("g", "p", "smooth"))
#averaging the replicated mixtures and splitting the data into training and test sets
library(plyr)
averaged=ddply(mixtures, .(Cement, BlastFurnaceSlag, FlyAsh, Water, Superplasticizer, CoarseAggregate,
                           FineAggregate, Age), function(x) c(CompressiveStrength=mean(x$CompressiveStrength)))
str(averaged)
set.seed(975)
fortraining=createDataPartition(averaged$CompressiveStrength, p=3/4)[[1]]
trainingset=averaged[fortraining,]
testset=averaged[-fortraining,]

Tomorrow, I will continue to do computing of Chapter 10.

11/11/2016

Chapter 9

Today, I read Chapter 9.

Summary
A Summary of Solubility Models
With the exception of poorly performing models, there is a fairly high correlation between the results derived from resampling and the test set (0.9 for the RMSE and 0.88 for R2).
There was a “pack” of models that showed better results, including model trees, linear regression, penalized linear models, MARS, and neural networks.
The group of high-performance models include support vector machines (SVMs), boosted trees, random forests, and Cubist.


There are very few statistically significant differences among high-performance models. Given this, any of these models would be a reasonable choice.


Next week, I will read Chapter 10, which is the last chapter of Part 2 of the book. After finishing Chapter 10, I think it is time to apply them into the real data.

11/10/2016

Chapter 8 Exercises

Today, I finished exercises of Chapter 8.

# 8.5
library(caret)
data(tecator)
absorp
# moisture fat protein
endpoints
absorpchange=absorp
dim(absorpchange)
absorpchange2=cbind(absorpchange, moisture=endpoints[,1])
absorpchange2=data.frame(absorpchange2)
library(RWeka)
treemodel1=M5P(moisture~., data=absorpchange2)
plot(treemodel1)
set.seed(100)
treemodel2=train(absorpchange2[,1:100], absorpchange2$moisture, method="M5", trControl = trainControl(method = "cv"),
             control= Weka_control(M=10))
treemodel2 # RMSE min = 6.5
plot(treemodel2)
toohigh=findCorrelation(cor(absorpchange2[,1:100]), cutoff = 0.999)
toohigh
trainabsorp=absorpchange2[,-toohigh]
treemodel3=M5P(moisture~., data=trainabsorp)
plot(treemodel3)
set.seed(100)
treemodel4=train(trainabsorp[,1:4], trainabsorp$moisture, method="M5", trControl = trainControl(method = "cv"),
                 control= Weka_control(M=10))
treemodel4 # RMSE min = 6.0
plot(treemodel4)
 

# 8.6
library(AppliedPredictiveModeling)
data(permeability)
dim(fingerprints)
permeability
library(caret)
lowfrequencies=nearZeroVar(fingerprints, freqCut = 20, uniqueCut = 10)
fingerfiltered=fingerprints[, -lowfrequencies]
dim(fingerfiltered)
library(randomForest)
rfmodel=randomForest(permeability~., data=fingerfiltered)
plot(rfmodel)
rfmodel2=randomForest(fingerfiltered, permeability, importance = TRUE, ntrees=1000)
plot(rfmodel2)
imp=importance(rfmodel2)
# %IncMSE: the higher number, the more important
# IncNodePurity: More useful variables achieve higher increases in node purities



Tomorrow, I will read Chapter 9.

11/09/2016

Chapter 8 (Computing & Exercises)

Today, I finished Computing and did exercises of Chapter 8.

Computing
# 5. Boosted Trees (gbm: gradient boosting machines)
library(gbm)
gbmmodel=gbm.fit(solTrainXtrans, solTrainY, distribution = "gaussian")
gbmmodel=gbm(y~., data=trainData, distribution = "gaussian")
# The furthest you can go is to split each node until there is only 1 observation in each terminal node.
# This would correspond to n.minobsinnode=1.
gbmgrid=expand.grid(.interaction.depth=seq(1, 7, by=2), .n.minobsinnode=10,
                    .n.trees=seq(100, 1000, by=50), .shrinkage=c(0.01, 0.1))
set.seed(100)
gbmtune=train(solTrainXtrans, solTrainY, method="gbm", tuneGrid = gbmgrid, verbose=FALSE)
gbmtune
system.time(gbmtune)

# 6. Cubist
library(Cubist)
# an argument, committees, fits multiple models
cubistmodel=cubist(solTrainXtrans, solTrainY, committees = 5)
cubistmodel
# an argument, neighbors,can take on a single integer value (0-9) to
# adjust the rule-based predictions from the training set
cubistpred=predict(cubistmodel, solTestXtrans)
summary(cubistpred)
head(cubistpred)
# the train function in the caret package can tune the model over values of
# committees and neighbors through resampling
cubisttuned=train(solTrainXtrans, solTrainY, method="cubist")
cubisttuned

Exercises
library(mlbench)
set.seed(200)
simulated=mlbench.friedman1(200, sd=1)
simulated=cbind(simulated$x, simulated$y)
simulated=as.data.frame(simulated)
colnames(simulated)[ncol(simulated)]="y"
head(simulated)
library(randomForest)
library(caret)
model1=randomForest(y~., data=simulated, importance=TRUE, ntree=1000)
rfimp1=varImp(model1, scale=FALSE)
rfimp1
simulated$duplicate1=simulated$V1+rnorm(200)*0.1
cor(simulated$duplicate1, simulated$V1)
model2=randomForest(y~., data=simulated, importance=TRUE, ntree=1000)
rfimp2=varImp(model2, scale=FALSE)
rfimp2
library(party)
model3ctr=cforest_control(mtry=ncol(simulated)-1)
model3tree=cforest(y~., data=simulated, controls = model3ctr)
model3tree
cfimp=varimp(model3tree)
cfimp

Tomorrow, I will continue to do exercises of Chapter 8.

11/08/2016

Chapter 8 Computing

Today, I did Computing of Chapter 8.

Summary
# the R packages used in this section are caret, Cubist, gbm, ipred, party, partykit,
# randomForest, rpart, RWeka

# 1. Single Trees

# formula method
library(rpart)
rparttree=rpart(y~., data=trainData)
library(party)
ctreetree=ctree(y~., data=trainData)

library(caret)
set.seed(100)
rparttune1=train(solTrainXtrans, solTrainY, method="rpart2", tuneLength=12, trControl=trainControl(method="cv"))
rparttune1
set.seed(100)
rparttune2=train(solTrainXtrans, solTrainY, method="rpart", tuneLength=10, trControl=trainControl(method="cv"))
rparttune2
?rpart.control
?ctree_control
plot(rparttune1)
plot(rparttune2)

library(partykit)
# convert the rpart object to a party object
rparttree2=as.party(rparttree)
plot(rparttree2)

# 2. Model Trees
library(RWeka)

# formula method
m5tree=M5P(y~., data=trainData)
m5rules=M5Rules(y~., data=trainData)

set.seed(100)
m5tune=train(solTrainXtrans, solTrainY, method="M5", trControl = trainControl(method = "cv"),
# M=10 is the minimum number of samples needed to further splits the data to be 10
             control= Weka_control(M=10))
plot(m5tune)

# 3. Bagged Trees
library(ipred)
# bagging uses the formula interface and ipredbagg has the non-formula interface
baggedtree=ipredbagg(solTrainY, solTrainXtrans)
baggedtree=bagging(y~., data = trainData)
baggedtree2=as.party(baggedtree)

# mtry is equal to the number of predictors
bagctr1=cforest_control(mtry=ncol(trainData)-1)
baggedtree=cforest(y~., data=trainData, controls = bagCtr1)

# 4. Random Forest
library(randomForest)
rfmodel=randomForest(solTrainXtrans, solTrainY)
rfmodel=randomForest(y~., data=trainData)
plot(rfmodel)
# the default for mtry in regression is the number of prediction divided by 3
rfmodel2=randomForest(solTrainXtrans, solTrainY, importance = TRUE, ntrees=1000)
plot(rfmodel2)
importance(rfmodel2)


Tomorrow, I will continue to do computing of Chapter 8.

11/07/2016

Chapter 8 (8.5 8.6 8.7)

Today, I read Chapter 8 (8.5 8.6 8.7) of the book.

Summary
8.5 Random Forests
If we start with a sufficiently large number of original samples and a relationship between predictors and response that can be adequately modeled by a tree, then trees from different bootstrap samples may have similar structures to each other (especially at the top of the trees) due to the underlying relationship.
From a statistical perspective, reducing correlation among predictors can be done by adding randomness to the tree construction process.
After carefully evaluating these generalizations to the original bagging algorithm, Breiman constructed a unified algorithm called random forests.
Every model in the ensemble is then used to generate a prediction for a new ample and these m predictions are averaged to give the forest’s prediction.
Random forests’ tuning parameter is the number of randomly selected predictors, k, to choose from at each split. The practitioner must also specify the number of trees for the forest.
Compared to bagging, random forests is more computationally efficient on a tree-by-tree basis since the tree building process only needs to evaluate a fraction of the original predictors at each split, although more trees are usually required by random forests.
The ensemble nature of random forests makes it impossible to gain an understanding of the relationship between the predictors and the response.
Strobl et al. (2007) developed an alternative approach for calculating importance in random forest models that takes between-predictor correlations into account.

8.6 Boosting
AdaBoost algorithm

Gradient boosting machines
The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function.
When regression tree are used as the base learner, simple gradient boosting for regression has two tuning parameters: tree depth and number of iterations.
In random forests, all trees are created independently, each tree is created to have maximum depth, and each tree contributes equally to the final model. The trees in boosting, however, are dependent on past trees, have minimum depth, and contribute unequally to the final model.
A regularization strategy can be injected into the final line of the loop. Instead of adding the predicted value for a sample to previous iteration’s predicted value, only a fraction of the current predicted value is added to the previous iteration’s predicted value. This fraction is commonly
referred to as the learning rate and is parameterized by the symbol, λ. This parameter can take values between 0 and 1 and becomes another tuning parameter for the model.
Friedman updated the boosting machine algorithm with a random sampling scheme and termed the new procedure stochastic gradient boosting.

8.7 Cubist
Some specific differences between Cubist and the previously described approaches for model trees and their rule-based variants are:
• The specific techniques used for linear model smoothing, creating rules, and pruning are different
• An optional boosting—like procedure called committees
• The predictions generated by the model rules can be adjusted using nearby points from the training set data
y ̂_par=a×y ̂_k+(1-a)×y ̂_p
where y ̂_k is the prediction from the current model and y ̂_p is from parent model above it in the tree.
If the covariance is large, this implies that the residuals generally have the same sign and relative magnitude, while a value near 0 would indicate no (linear) relationship between the errors of the two models.
Let e_k be the collection of residuals of the child model and e_p be similar values for the parent model.
The smoothing coefficient used by Cubist is:
a=(Var[e_p ]-Cov[e_k,e_p])/(Var[e_p-e_k])
Var[e_p ] is proportional to the parent model’s RMSE.
The mth committee model uses an adjusted response
y_((m))^*=y-(y ̂_((m-1) )-y)
To tune this model, different numbers of committees and neighbors were assessed.

Tomorrow, I will start computing on Chapter 8.

11/05/2016

Chapter 8 (8.3 8.4)

Today, I read Chapter 8 (8.3 8.4).

Summary
8.3 Rule-Based Models
The complexity of the model tree can be further reduced by either removing entire rules or removing some of the conditions that define the rule.
In figure 8.12, pruning has a large effect on the model and smoothing just has a large impact on the unpruned models.
The number of terms in the linear models decreases as more rules are created. This makes sense because there are fewer data points to construct deep trees.

8.4 Bagged Trees
Bagging, short for bootstrap aggregation, is a general approach that uses bootstrapping in conjunction with any regression model to construct an ensemble.
Advantages:
1.     Reduce the variance and be more stable (average)
2.     Provide their own internal estimate of predictive performance that correlates well with either cross-validation estimates or test set estimates (out-of-bag samples)
Most improvement in predictive performance is obtained aggregating across ten bootstrap replications.
Caveats:
1.     Computational costs and memory requirements increase as the number of bootstrap samples increases. (parallel computing)

2.     A bagged model is less interpretable than a model that is not bagged. (variable importance)

Next week, I will continue to read Chapter 8.

11/02/2016

Chapter 8 (8.1 8.2)

Today, I read Chapter 8 (8.1 8.2).

Summary
Once the tree has been finalized, we begin to assess the relative importance of the predictors to the outcome.
If SSE is the optimization criteria, then the reduction in the SSE for the training set is aggregated for each predictor.
The model tends to rely more on continuous predictors than the binary ones.
Unbiased Regression Tree Techniques:
GUIDE: generalized, unbiased, interaction detection and estimation algorithm (decouple the process of selecting the split variable and the split value)
Conditional Inference Trees: statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points.
For a candidate split, a statistical test is used to evaluate the difference between the means of the two groups created by the split and a p-value can be computed for the test.

Regression Model Trees
M5
The splitting criterion is different.
The terminal nodes predict the outcome using a linear model (as opposed to the single average).
When a sample is predicted, it is often a combination of the predictions from different models along the same path through the tree.
The main implementation of this technique is a “rational reconstruction” of this model call M5.
Split criterion:
reduction=SD(S)-∑_(i=1)^P▒〖n_i/n×SD(S_i)〗

n_i is the number of samples in partition i.
The split that is associated with the largest reduction in error is chosen and a linear model is created within the partitions using the split variable in the model.
Once the complete set of linear models have been created, each undergoes a simplification procedure to potentially drop some of the terms.
Adjusted Error Rate=n^*+pn*-pi=1n*yi-yi

n^* is the number of training set data points that were used to build the model and p is the number of parameters.
Model trees also incorporate a type of smoothing to decrease the potential for over-fitting. The technique is based on the “recursive shrinking” methodology.
The two predictions are combined using
y_((p) )=(n_((k)) y ̂_((k) )+cy ̂_((p) ))/(n_((k))+c)
(the equation on page 185 may be wrong)

y ̂_((k) ) is the prediction from the child node, n_((k)) is the number of training set data points in the child node, y ̂_((p) ) is the prediction from the parent node, and c is a constant with a default value of 15.
Pruning & Smoothing
Smoothing the models has the effect of minimizing the collinearity issues.

Tomorrow, I will continue to read Chapter 8.

11/01/2016

Chapter 7 exercises

Today, I finished exercises in Chapter 7 and started reading Chapter 8.

Exercises in Chapter 7
#7.1
set.seed(100)
x=runif(100, min=2, max=10)
y=sin(x)+rnorm(length(x))*0.25
sindata=data.frame(x=x, y=y)
plot(x, y)
datagrid=data.frame(x=seq(2, 10, length=100))
library(kernlab)
rbfsvm=ksvm(x=x, y=y, data=sindata, kernel="rbfdot", kpar="automatic", C=1, epsilon=0.1)
#C: cost of constraints violation; epsilon: epsilon in the insensitive-loss function
#kpar=list(sigma=n), when n gets smaller, the line gets flatter. It can be kpar="automatic".
modelprediction=predict(rbfsvm, newdata=datagrid)
points(x=datagrid$x, y=modelprediction[,1], type="l", col="blue")
#type="p, l, b, c, o, s, h, n"

#7.2
library(mlbench)
set.seed(200)
#inputs are 10 independent variables uniformly distributed on the interval [0,1],
#only 5 out of 10 are actually used in the formula
trainingdata2=mlbench.friedman1(200, sd=1)
#convert 'x' data from a matrix to a data frame
trainingdata2$x=data.frame(trainingdata2$x)
featurePlot(trainingdata2$x, trainingdata2$y)
library(caret)
knnmodel=train(trainingdata2$x, trainingdata2$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
knnmodel
knnpred=predict(knnmodel, newdata=trainingdata2$x)
postResample(pred = knnpred, obs = trainingdata2$y)

#7.3
#build SVM, neural network, MARS and KNN mdoels

# 7.4
#if nonlinear models outperform the optimal linear model, it shows that there is a nonlinear relationship between
#predictors and outcomes. if there is mo significant outperformance, linear model will be recommended.

#7.5
#top important predictors of optimal linear model and optimal nonlinear model are different.
#top predictors which are unique to the optimal nonlinear model is proved to own definitely a nonlinear
#relationship with outcomes

Chapter 8
Regression Trees and Rule-Based Models
Weaknesses: 1. Model instability; 2. Less-than-optimal predictive performance
If the relationship between predictors and the response cannot be adequately defined by rectangular subspaces of the predictors, then tree-based or rule-based models will have larger prediction error than other kinds of models.

8.1
Basic Regression Trees
The predictor to split on and value of the split
The depth or complexity of the tree
The prediction equation in the terminal nodes
CART: classification and regression tree methodology
SSE finds the optimal split-point for every predictor
SSE=∑_(i∈S_1)▒〖(y_i-y ̅_1)〗^2 +∑_(i∈S_2)▒〖(y_i-y ̅_2)〗^2
Cost-complexity tuning
SSEC_p=SSE+Cp×(#Terminal Nodes)
C_p is the complexity parameter
Smaller penalties tend to produce more complex models, which result in larger trees.

Tomorrow, I will continue to read Chapter 8.