Summary
Model building:
Pre-processing the predictor data
Estimating model parameters
Selecting predictors for the model
Evaluating model performance
Fine tuning class prediction rules
Resampling techniques often produce performance estimates superior to a single test set.
Nonrandom approaches
Random approaches
Simple random sampling
Stratified random sampling
Maximum dissimilarity sampling
Resampling techniques
k-fold cross-validation
The samples are randomly partitioned into k sets of roughly equal size. A model is fit using the all samples except the first subset (called the first fold).
As k gets larger, the different in size between the training set and the resampling subsets gets smaller.
Repeating k-fold cross-validation can be used to effectively increase the precision of the estimates while still maintaining s small bias.
Generalized cross-validation
GCV=1/n ∑_(i=1)^n▒〖((y_i-y ̂_i)/(1-df/n))〗^2
df is the degrees of freedom of the model. The degrees of freedom are an accounting of how many parameters are estimated by the model(complexity for linear regression models).
Repeated Training/Test Splits
It is a function of the proportion of samples being randomly allocated to the prediction set; the larger the percentage, the more repetitions are needed to reduce the uncertainty in the performance estimates.
The Bootstrap
The bias will decrease as the training set sample size become larger.
Tomorrow, I will continue Chapter 4.
No comments:
Post a Comment