Update Everyday: October 2016

10/31/2016

Chapter 7 Computing

Today, I finished reading the first four parts of Chapter 7 and completed the computing part of it.

Computing:
library(caret)
library(earth)
library(kernlab)
library(nnet)
#R has a number of packages and functions for creating neural networks such as nnet, neural and RSNNS.
#1. neural networks
nnetfit=nnet(predictors, outcome, size=5, decay=0.01, linout = TRUE, trace = FALSE,
maxit = 500, MaxNWts = 5*(ncol(predictors)+1)+5+1)
#trace=F: reduce the amount of printed output
#maxit: number of iterations to find parameter estimates
#MaxNWts: number of parameters
#this creates a single model with 5 hidden units assuming that the data in predictors have been
#standardized to be on the same scale
nnetavg=avNNet(predictors, outcome, size=5, decay=0.01, repeats = 5, linout = TRUE,
trace = FALSE, maxit=500, MaxNWts = 5*(ncol(predictors)+1)+5+1)
predict(nnetfit, newdata)
predict(nnetavg, newdata)
toohigh=findCorrelation(cor(solTrainXtrans), cutoff = 0.75)
trainxnnet=solTrainXtrans[,-toohigh]
testxnnet=solTestXtrans[,-toohigh]
#define the candidate models to test
nnetgrid=expand.grid(.decay=c(0, 0.01, 0.1), .size=c(1:10), .bag=FALSE)
set.seed(100)
nnettune=train(solTrainXtrans, solTrainY, method="avNNet", tuneGrid = nnetgrid, trControl = ctr1,
preProcess = c("center", "scale"), linout = TRUE, trace = FALSE,
MaxNWts = 10*(ncol(predictors)+1)+10+1, maxit=500)
#2. multivariate adaptive regression splines
marsfit=earth(solTrainXtrans, solTrainY)
marsfit
summary(marsfit)
marsgrid=expand.grid(.degree=1:2, .nprune=2:38)
set.seed(100)
marstuned=train(solTrainXtrans, solTrainY, method="earth", tuneGrid = marsgrid,
trControl = trainControl(method = "cv"))
marstuned
marspredict=predict(marstuned, solTestXtrans)
head(marspredict)
#estimate the importance of each predictor in the MARS model
varImp(marstuned)
#3. Support Vector Machines
#in kernlab package, the ksvm function is available for regression models and a large number of kernel functions
#the radial basis function is the default kernel function
library(kernlab)
#if appropriate values o the cost and kernel parameters are known, the model can be fit as:
svmfit=ksvm(x=solTrainXtrans, y=solTrainY, kernel="polydot", kpar="automatic", C=1, epsilon=0.1)
#if the values are unknown, they can be estimated through resampling.
#in train, "svmRadial", "svmLinear" or "svmPoly" fit different kernels
svmrtuned=train(solTrainXtrans, solTrainY, method = "svmRadial", preProcess = c("center", "scale"),
tuneLength = 14, trControl = trainControl(method = "cv"))
svmrtuned
svmrtuned$finalModel
#4. k-nearest neighbors
knndescr=solTrainXtrans[, -nearZeroVar(solTrainXtrans)]
set.seed(100)
knntune=train(knndescr, solTrainY, method = "knn", preProcess = c("center", "scale"),
tuneGrid = data.frame(.k=1:20), trControl = trainControl(method = "cv"))
knntune

Tomorrow, I will do exercises on Chapter 7.

10/29/2016

find the code

Today, I finally found the code of how to extract the remaining data from a matrix without the sample submatrix.

#biological predictors can be used to assess the quality of the raw mateiral before processing.
#manufacturing process predictors can be changed in the manufacturing process.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess
#12 biological predictors, 45 process predictors, 176 manufacturing runs
head(ChemicalManufacturingProcess)
set.seed(1)
trainrows=createDataPartition(ChemicalManufacturingProcess[ ,1], p=0.8, list = FALSE)
trainrows
trainpredictors=ChemicalManufacturingProcess[trainrows, ]
dim(trainpredictors)
testpredictors=ChemicalManufacturingProcess[-trainrows, ]
dim(testpredictors)

The function of it is to remove the sample submatrix from the main matrix and use the remaining data to build another matrix. So we can obtain both training data matrix and testing data matrix from the data matrix that we have.

Next week, I will continue to read Chapter 7.

10/27/2016

Chapter 7 (7.1 7.2)

Today, I read Chapter 7 (7.1 7.2).

Summary

As the regulation value increases, the fitted model becomes more smooth and less likely to over-fit the training set. Reasonable values of

range between 0 and 0.1. Since the regression coefficients are being summed, they should be on the scale; hence the predictors should be centered and scaled prior to modeling.

There are many other kinds such as models where there are more than one layer of hidden units (i.e., there is a layer of hidden units that models the other hidden units). Also, other model architectures have loops going both directions between layers.

A model similar to neural networks is self-organizing maps. This model can be used as an unsupervised, exploratory technique or in a supervised fashion for prediction.

The resulting parameter estimates are hardly to be the globally optimal estimates. As an alternative, several models can be created using different starting values and averaging the results of these models to produce a more stable prediction.

Multivariate Adaptive Regression Splines (MARS)

Once the full set of features has been created, the algorithm sequentially removes individual features that do not contribute significantly to the model equation.

GCV: generalized cross-validation

There are two tuning parameters associated with the MARS model: the degree of the features that are added to the model and the number of retained terms. The latter parameter can be automatically determined using the default pruning procedure (using GCV), set by the user or determined using an external resampling technique.

Since the GCV estimate does not reflect the uncertainty from feature election, it suffers from selection bias.

Two advantages of MARS:

1. The model automatically conducts feature selection; 2. interpretability

Tomorrow, I will continue to read Chapter 7.

10/26/2016

Chapter 7

Today, I finished exercises of Chapter 6 and read Chapter 7.

exercise 6.3
#biological predictors can be used to assess the quality of the raw mateiral before processing.
#manufacturing process predictors can be changed in the manufacturing process.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess
#12 biological predictors, 45 process predictors, 176 manufacturing runs
head(ChemicalManufacturingProcess)
set.seed(1)
trainrows=createDataPartition(ChemicalManufacturingProcess[ ,1], p=0.8, list = FALSE)
trainrows
trainpredictors=ChemicalManufacturingProcess[trainrows, ]
dim(trainpredictors)
testpredictors=ChemicalManufacturingProcess[-trainrows, ]
dim(testpredictors)

Summary of Chapter 7

Nonlinear Regression Models

Like PLS, the outcome is modeled by an intermediary set of unobserved variables (hidden variables or hidden units). These hidden units are linear combinations of the original predictors, but unlike PLS models, they are not estimated in a hierarchical fashion.

The linear combination is typically transformed by a nonlinear function g, such as the logistic (i.e., sigmoidal) function:

The total number of parameters is H(P+1)+H+1.

Back-propagation algorithm is a highly efficient methodology that works with derivatives to find the optimal parameters. However, it is common that a solution to this equation is not a global solution.

Weight decay (moderate over-fitting):

An alternative version of the sum of the squared errors:

Tomorrow, I will continue to read Chapter 7.

10/25/2016

Exercise 6 (6.1 6.2)

Today, I did exercises of Chapter 6 (6.1 6.2).

6.1
#the theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently.
library(caret)
data(tecator)
absorp
#absorbance values 215*100
endpoints
#percent of moisture, fat, and protein 215*3
#first way
corthresh=0.99
toohigh=findCorrelation(absorp, corthresh)
corrpred=names(absorp)[toohigh]
absorpeffective=absorp[, -toohigh]
absorpeffective
#second way
correlations=cor(absorp)
dim(correlations)
library(corrplot)
corrplot(correlations, order="hclust")
highcorr=findCorrelation(correlations, cutoff=0.8)
length(highcorr)
absorpeffective=absorp[, -highcorr]
absorpeffective

6.2

library(AppliedPredictiveModeling)

data(permeability)

library(caret)

lowfrequencies=nearZeroVar(fingerprints, freqCut = 20, uniqueCut = 10)

fingerfiltered=fingerprints[, -lowfrequencies]

fingerfiltered

ncol(fingerfiltered)

colnames(fingerfiltered)

#binary predictors decreases from 1107 to 413

trainfinger=fingerfiltered[sample(1:165, 132, replace=FALSE), ]

dim(trainfinger)

trainfinger

#X1=fingerfiltered[ ,1]

#X1

#trainX1=sample(X1, 132, replace = FALSE)

#trainX1

#trainfinger=fingerfiltered[trainX1, ]

#trainfinger

a=c(121, 149, 114, 40, 104, 45, 153, 26, 66, 162, 15, 128, 81, 102, 62, 127, 110, 52, 140, 95, 6, 86, 60, 11, 75,

135, 99, 77, 34, 106, 89, 112, 87, 64, 65, 50, 59, 105, 118, 19, 94, 165, 120, 43, 48, 115, 13, 111, 41, 151,

116, 82, 46, 18, 38, 69, 7, 92, 23, 58, 155, 137, 132, 27, 131, 49, 93, 85, 36, 31, 79, 122, 30, 9, 145, 63,

143, 80, 152, 159, 134, 67, 126, 3, 70, 55, 141, 53, 101, 28, 139, 125, 109, 14, 42, 78, 147, 158, 54, 4, 117,

107, 150, 57, 119, 103, 156, 160, 146, 47, 29, 76, 164, 83, 73, 154, 72, 37, 124, 51, 24, 133, 22, 17, 98, 1,

97, 157, 148, 123, 21, 130)

testfinger=fingerfiltered[-a, ]

testfinger

(I do not know how to delete a random submatrix from a matrix and obtain the remaining one as the other matrix, I spent hours searching online but I cannot find it. I will try to solve it because it is important to my research.)

Tomorrow, I will try to figure out the problem and go on doing exercises of Chapter 6.

10/24/2016

Computing part of Chapter 6

Today, I continued to finish Computing part of Chapter 6.

Codes (ordianry linear regression):
corthresh=0.9
toohigh=findCorrelation(cor(solTrainXtrans), corthresh)
corrpred=names(solTrainXtrans)[toohigh]
trainxfiltered=solTrainXtrans[, -toohigh]
testxfiltered=solTestXtrans[, -toohigh]
set.seed(100)
lmfiltered=train(x=trainxfiltered, y=solTrainY, method = "lm", trControl = ctr1)
lmfiltered
#solTrainY: 951*1; solTrainXtrans: 951*228; solTestXtrans: 316*228;
#trainxfiltered: 951*190; testxfiltered: 316*190
#the codes in P132 of Applied Predictive Modeling may be wrong
set.seed(100)
rlmpca=train(solTrainXtrans, solTrainY, method = "rlm", preProcess = "pca", trControl = ctr1)
#wait a minute
rlmpca

Codes (partial least squares):
library(pls)
plsfit=plsr(solubility~., data=trainingdata)
predict(plsfit, solTestXtrans[1:5,], ncomp = 1:2)
set.seed(100)
plstune=train(solTrainXtrans, solTrainY, method = "pls", tuneLength = 20, trControl = ctr1, preProcess = c("center", "scale"))
plstune

Codes (penalized regression models):
library(elasticnet)
#create ridge-regression models
#as lambda decreases to 0, we get the least squares solutions. lambda argument specifies the ridge-regression penalty
ridgemodel=enet(x=as.matrix(solTrainXtrans), y=solTrainY, lambda = 0.001)
ridgemodel
ridgepred=predict(ridgemodel, newx = as.matrix(solTestXtrans), s=1, model="fraction", type = "fit")
ridgepred
head(ridgepred$fit)
#define the candidate set of values
ridgegrid=data.frame(.lambda=seq(0, 0.1, length=15))
set.seed(100)
ridgeregfit=train(solTrainXtrans, solTrainY, method="ridge", tuneGrid=ridgegrid, trControl=ctr1, preProcess=c("center", "scale"))
ridgeregfit
enetmodel=enet(x=as.matrix(solTrainXtrans), y=solTrainY, lambda = 0.01, normalize = TRUE)
#normalize is for centering and scaling predictors prior to modeling
#lambda controls the ridge-regression penalty and, setting this value to 0, fits the lasso model
enetpred=predict(enetmodel, newx = as.matrix(solTestXtrans), s=0.1, mode="fraction", type="fit")
names(enetpred)
head(enetpred$fit)
enetcoef=predict(enetmodel, newx = as.matrix(solTestXtrans), s=0.1, mode="fraction", type="coefficients")
tail(enetcoef$coefficients)
#other packages: biglars (for large data sets), FLLat (for the fused lasso), grplasso (the group lasso), penalized, relaxo (the relaxed lasso)...
enetgrid=expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
set.seed(100)
enetTune=train(solTrainXtrans, solTrainY, method = "enet", tuneGrid = enetgrid, trControl = ctr1, preProcess = c("center", "scale"))
enetTune
plot(enetTune)

Tomorrow, I will do exercises of Chapter 6.

10/22/2016

Chapter 6 Computing

Today, I did computing of Chapter 6.

library(AppliedPredictiveModeling)
data(solubility)
ls(pattern = "^solT")
set.seed(2)
sample(names(solTrainX), 8)
# 1. ordianry linear regression
trainingdata=solTrainXtrans
#add the solubility outcome
trainingdata$solubility=solTrainY
lmfitallpredictors=lm(solubility~., data=trainingdata)
summary(lmfitallpredictors)
lmpred1=predict(lmfitallpredictors, solTestXtrans)
head(lmpred1)
nrow(solTestXtrans)
lmvalues1=data.frame(obs=solTestY, pred=lmpred1)
library(caret)
defaultSummary(lmvalues1)
library(MASS)
rlmfitallpredictors=rlm(solubility~., data=trainingdata)
summary(rlmfitallpredictors)
rlmpred1=predict(rlmfitallpredictors, solTestXtrans)
head(rlmpred1)
rlmvalues1=data.frame(obs=solTestY, pred=rlmpred1)
defaultSummary(rlmvalues1)
#generate a resampling estimate of performance
ctr1=trainControl(method = "cv", number = 10)
set.seed(100)
lmfit1=train(x=solTrainXtrans, y=solTrainY, method = "lm", trControl = ctr1)
xyplot(solTrainY~predict(lmfit1),
type=c("p","g"), xlab = "Predicted", ylab = "Observed")
xyplot(resid(lmfit1)~predict(lmfit1),
type=c("p","g"), xlab = "Predicted", ylab = "Residuals")
#resid: generate the model residuals; predict: return the predicted values

I will try to finish computing at the weekend.

10/20/2016

Chapter 6 (6.3 6.4)

Today, I finished reading 6.3 and 6.4 of Chapter 6.

Summary

Both PCR and PLS have similar predictive ability, but PLS does so with far fewer components.

The NIPALS algorithm works fairly efficiently for data sets of small-to-moderate size (< 2500 samples and < 30 predictors). When the number of samples and predictors climbs, the algorithm becomes inefficient.

Kernel approach: improve the speed of the algorithm

SIMPLS: deflate the covariance matrix between the predictors and the response

Covariance:

GIFI approach: split each predictor into two or more bins for those predictors that are thought to have a nonlinear relationship with the response. Cut points for the bins are selected by the user and are based on either prior knowledge or characteristics of the data.

Penalized models:

A generalization of the lasso model is the elastic net:

This model will more effectively deal with groups of high correlated predictors.

Tomorrow, I will begin computing of Chapter 6.

10/17/2016

Chapter 6 (6.1 6.2 6.3)

Today, I read some parts of Chapter 6.

Summary

QSAR: quantitative structure-activity relationship modeling

NIPALS: nonlinear iterative partial least squares algorithm

The objective of ordinary least squares linear regression is to find the plane that minimizes the sum-of-squared errors (SSE) between the observed and predicted response:

On many occasions, relationships among predictors can be complex and involve many predictors. In these cases, manual removal of specific predictors may not be possible and models that can tolerate collinearity may be more useful.

Box-Cox transformation can be applied to the continuous predictors in order to remove skewness.

If the correlation among predictors is high, then the ordinary least squares solution for multiple linear regression will have high variability and will become unstable.

If the number of predictors are greater than the number of observations, ordinary least squares in its usual form will be unable to find a unique set of regression coefficients that minimize the SSE.

Pre-processing predictors via PCA (dimension reduction) prior to performing regression is known as principal component regression (PCR). It has been widely applied in the context of problems with inherently highly correlated predictors or problems with more predictors than observations.

The author recommends using PLS when there are correlated predictors and a linear regression-type solution is desired.

PCA

PLS

While PCA linear combinations are chosen to maximally summarize predictor space variability, the PLS linear combinations of predictors are chosen to maximally summarize covariance with the response.

Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude.

PLS has one tuning parameter: the number of components to retain. Resampling techniques can be used to determine the optimal number of components.

Tomorrow, I will continue to read Chapter 6.

10/14/2016

Chapter 5 Computing

Today, I finished Computing on Chapter 5.

observed=c(0.22, 0.83, -0.12, 0.89, -0.23, -1.30, -0.15, -1.4, 0.62, 0.99, -0.18, 0.32, 0.34,
-0.3, 0.04, -0.87, 0.55, -1.3, -1.15, 0.2)
predicted=c(0.24, 0.78, -0.66, 0.53, 0.7, -0.75, -0.41, -0.43, 0.49, 0.79, -1.19, 0.06, 0.75,
-0.07, 0.43, -0.42, -0.25, -0.64, -1.26, -0.07)
#in practice, the vector of predictions would be produced by the model function
residualvalues=observed-predicted
residualvalues
summary(residualvalues)
axisrange=extendrange(c(observed, predicted))
plot(observed, predicted, ylim=axisrange, xlim=axisrange)
abline(0, 1, col="darkgrey", lty=2)
plot(predicted, residualvalues, ylab = "residual")
abline(h=0, col="darkgrey", lty=2)
library(caret)
R2(predicted, observed)
RMSE(predicted, observed)
cor(predicted, observed)
cor(predicted, observed, method = "spearman")

Next week, I will read on Chapter 6.