9/29/2016

Exercises

Today, I read the book “Applied Predictive Modeling”. I am trying to accomplish all the exercises of Chapter 3. But there are still some problems. For example, I cannot install some packages which are shown in the book. So I cannot finish the following tasks.

Summary

Creating Dummy Variables
1.
Head(carSubset)
Levels(carSubset$Type)
Simplemod=dummyVars( Mileage+Type, data=carSubset, levelsOnly=TRUE)
Simplemod
Predict(simplemod, head(carSubset))

Exercises
1.
> library(mlbench)
> data(Glass)
> Glass
> library(e1071)
> type=grep("Type", names(Glass))
> Glassvalue=Glass[, -type]
> Glassvalue
> skewvalues=apply(Glassvalue, 2, skewness)
> skewvalues
        RI         Na         Mg         Al         Si          K         Ca 
 1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
        Ba         Fe 
 3.3686800  1.7298107
> correlations=cor(Glassvalue)





Tomorrow, I will try to figure the problems out.

9/28/2016

Data Pre-processing Computing

Today, I read the book “Applied Predictive Modeling”. I started following the codes in the book.

Summary
Know the meaning of the following codes:
1.
> segData=subset(segmentationOriginal, Case=="Train")
> case=segData$Case
> segData=segData[, -(1:3)]
> statusColNum=grep("Status", names(segData))
> segData=segData[, -statusColNum]

2.
> grep("[a-z]", letters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
>
> txt <- c("arm","foot","lefroo", "bafoobar")
> if(length(i <- grep("foo", txt)))
+    cat("'foo' appears at least once in\n\t", txt, "\n")
'foo' appears at least once in
         arm foot lefroo bafoobar
> i # 2 and 4
[1] 2 4
> txt[i]

Transformations
3.
> library(e1071)
> skewness(segData$AngleCh1)
> skewValues=apply(segData, 2, skewness)
> head(skewValues)

4.
> library(caret)
> Ch1AreaTrans=BoxCoxTrans(segData$AreaCh1)
> Ch1AreaTrans
Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  150.0   194.0   256.0   325.1   376.0  2186.0

Largest/Smallest: 14.6
Sample Skewness: 3.53

Estimated Lambda: -0.9
> head(segData$AreaCh1)
[1] 819 431 298 256 258 358
> predict(Ch1AreaTrans, head(segData$AreaCh1))
[1] 1.108458 1.106383 1.104520 1.103554 1.103607 1.105523
> (819^(-0.9)-1)/(-0.9)
[1] 1.108458

5.
> pcaobject=prcomp(segData, center=TRUE, scale.=TRUE)
> percentvariance=pcaobject$sd^2/sum(pcaobject$sd^2)*100
> percentvariance[1:3]
> head(pcaobject$x[, 1:5])
> head(pcaobject$rotation[, 1:5])

spatialSign(…) contains function for the spatial sign transformation
impute.knn uses K-nearest neighbors to estimate the missing data
preprocess function applies imputation methods based on K-nearest neighbors or bagged trees.
The order in which the possible transformation are applied is transformation, centering, scaling, imputation, feature extraction, and then spatial sign.
BoxCoxTrans can find the appropriate transformation and apply them to the new data by obtaining the information about data, including estimated Lambda.

Filtering
1.
> nearZeroVar(segData)

2.
> correlations=cor(segData)
> dim(correlations)
> correlations[1:5, 1:5]

3.
> library(corrplot)
> corrplot(correlations, order="hclust")

4.
> highcorr=findCorrelation(correlations, cutoff=0.75)
> length(highcorr)
> filteredsegdata=segData[, -highcorr]


Tomorrow, I will continue to learn computing.

Applied Predictive Modeling 3

Today, I read the book “Applied Predictive Modeling”. The reading part is Chapter 3, knowing how to deal with predictors in different ways in Data Pre-processing procedure.

Summary
Missing data should not be confused with censored data where the exact value is missing but something is known about its value.
One popular technique for imputation is a K-nearest neighbor model. A new sample is imputed by finding the samples in the training set ‘closest’ to it and averages these nearby points to fill in the value.

Removing predictors
A tree-based model is impervious to uninformative predictor since it would never be used in a split. However, a model such as linear regression would find these data problematic and is likely to cause an error in the computations.
1.     The fraction of unique values over the sample size is low (10%),
2.     The ratio of the frequency of the most prevalent value to that of the second most prevalent value is large (20).
If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.
Between-Predictor Correlations
If the first principal component accounts for a large percentage of the variance, this implies that there is at least one group of predictors that represent the same information.
Algorithm:
1.     Calculate the correlation matrix of the predictors.
2.     Determine the two predictors associated with the largest absolute pairwise correlation.
3.     Determine the average correlation between A and the other variables. Do the same for predictor B.
4.     If A has a larger average correlation, remove it; otherwise, remove predictor B.
5.     Repeat Steps 2-4 until no absolute correlations are above the threshold.

Adding Predictors
When a predictor is categorical, it is common to decompose the predictor into a set of more specific variables.

Binning Predictors

One common approach to simplifying a data set is to take a numeric predictor and pre-categorize or ‘bin’ it into two or more groups prior to data analysis.

Tomorrow, I will start try computing.

9/26/2016

Applied Predictive Modeling 2&3

Today, I read the book “Applied Predictive Modeling”. The reading part is Chapter 2 and some of 3, knowing the Data- preprocessing knowledge.

Summary
1.4 data sets
Part 1 General Strategies
An alternative approach for quantifying how well the model operates is to use resampling, instead of simply using the same data used to build the model.
Resampling techniques are discussed in Chapter 4.
Feature selection is discussed in Chapter 19.
Multivariate Adaptive Regression Spline (MARS) model and Quadratic Regression model can appropriate for the prediction of one-predictor issue.
Data Splitting, Predictor Data, Estimating Performance, Evaluating Several Models, Model Selection
The feature engineering methods of data pre-processing depends on the model being used and the true relationship with the outcome.

Data Transformations of Individual Predictors
Centering and Scaling
Center: predictors have a zero mean
Scale: predictors are divided by standard deviation
The only real downside to these transformation is a loss of interpretability of the individual values since the data are no longer in the original units.
Resolve Skewness
A right-skewed distribution has a large number of points on the left side of the distribution (smaller values) than on the right side (larger values).
The formula for the sample skewness statistic is
skewness=(∑▒〖(x_i-x ̅)〗^3 )/((n-1)ν^(3/2) ) where ν=(∑▒〖(x_i-x ̅)〗^2 )/((n-1))
empirical transformation: x^*={█((x^λ-1)/λ,&if λ≠0@log⁡(x),&if λ=0)┤
use maximum likelihood estimation (details are in the paper ‘an analysis of transformations’) to determine the transformation parameter λ

Data Transformations for Multiple Predictors
Tree-based classification models create splits of the training data so the outlier does not usually have an exceptional influence on the model.
Spatial Sign data transformation:
x_ij^*=x_ij/√(∑_(j=1)^P▒x_ij^2 )
It is important to center and scale the predictor data prior to using this transformation.

Principal component analysis (PCA) is a commonly used data reduction technique.
〖PC〗_j=(α_j1×Predictor 1)+(α_j2×Predictor 2)+⋯+(α_jP×Predictor P)

Tomorrow, I will read more of the book.

9/23/2016

Applied Predictive Modeling (Appendix B)

Today, I read the book “Applied Predictive Modeling”. The reading part is Appendix B, namely An Introduction to R. Generally, I know the basic knowledge of R.

An Introduction to R
apropos(“local R functions”)
RSiteSearch(“function”)
library(function)
install.packages(“package”)

The caret package (classification and regression training): quickly evaluate many different models to find the more appropriate tool for data, providing a unified interface to functions for model building and prediction.
Pre-processing and resampling
The foreach package allows R code to be run either sequentially or in parallel using several different technologies, such as the multicore or Rmpi packages. There are several R packages that work with foreach to implement these techniques, such as doMC (for multicore) or doMPI (for Rmpi).
Sweave: automatic generation of reports


I will continue to learn R next week.

R

Today, I searched for the softwares that could be used for my research. However, when I checked my email, Dr. Siddharth Misra said that I should use R.

Anyway, I have searched three kinds of softwares. They are Torch, Tensorflow and Python. They are all open sources. I will learn R first because R is also a wonderful software. In addition, Dr. Siddharth Misra sent me the book so that I can learn it faster than other softwares.


I have downloaded R. I will learn it from tomorrow.

9/21/2016

Reservoir properties from well logs using neural networks 3

Today, I finish reading the dissertation ‘Reservoir Properties from Well Logs Using Neural Networks’ Chapter 2.

Summary
Neural networks can be used to perform the two basic tasks of pattern recognition and function approximation.

Pattern recognition
An unsupervised network
1.     An unsupervised network for feature extraction, namely transforming the input patterns into an intermediate smaller dimensional space
2.     A supervised network for classification: map the intermediate patterns into one of the classes in an r-dimensional space where r is the number of classes to be distinguished.
A supervised network
It performs the task of feature extraction and classification both based on the information it has extracted from the training data.

Function approximation
Functional relationship:
Given a set of examples to map the function:
Approximate the unknown :  for all x

Bias variance dilemma

Overtraining:
A small bias but a large variance
Cross-validation approach:
Early stopping method of training
In this study we use the overtraining approach for predicting porosity and water saturation, cross validation approach for predicting permeability and a soft overtraining approach for lithofacies identification.

A MLP network:
Advantages: It does not require any assumptions. It exhibits a great degree of robustness or fault tolerance because of built-in redundancy. It has a strong capability for function approximation. Previous knowledge of the relationship between input and output is not necessary. The MLP can adapt its synaptic weights to changes in the surrounding environment by adjusting the weights to minimize the error.
Disadvantages: We use LM algorithm to adjusting weights. The mean square error surface of a multiple network may get stuck in the local minima instead of converging into the global minimum.

Multiple network system: 
An actual multiple network system could consist of a mixture of ensemble and modular combinations at different levels. The architecture of the networks for predicting porosity and water saturation are ensemble combination while the architecture of the networks for predicting permeability and lithofacies are modular and ensemble combination.

Ensemble combination:
1.    The bias of the ensemble averaged-function  pertaining to the CM is same as that of the function  pertaining to a single neural network.
2.    The variance of  is less than that of .
3.     The individual expert should be purposely over-trained to reduce the bias at the cost of the variance.
Training parameters: the initial weights, the training data, the topology of the networks and the training algorithm.
The problem with this method is that it requires large amounts of data.
1.     The training sets are adaptively resampled.
2.     Picking n samples from a training dataset of N samples
3.     Virtual samples


Providing weights to each network by two approaches:
Unconstrained approach:
Combined output: , where  is output from the individual network,  are the weights.
Approximation error is: .
Constrained approach:
There is an additional constraint, that is .
The weights can be calculated from the training dataset, or partially on the training and test dataset. This method is called the optimal linear combination (OLC) method.

Modular combination:
The single network has a large number of adjustable parameters and hence the risk of overfitting the training dataset increases.
The training time for such a large network is likely to be longer than for all the experts trained in parallel.
1.     Avoid of overfitting, save time
2.     Reduce model complexity, making the overall system easier to understand, modify, and extend
Class decomposition, automatic decomposition
There are four different models of combining component networks: cooperative, competitive, sequential and supervisory.
Ensembles are always involved in cooperative combination.


The input-output relationship is linear in MLR whereas the relationship is nonlinear in neural network. The neural network method does not force predicted values to lie near the mean values and thus preserves the natural variability in the data.

Tomorrow, I will read more of the dissertation.

9/20/2016

Reservoir properties from well logs using neural networks 2

Today, I read the dissertation ‘Reservoir Properties from Well Logs Using Neural Networks’. The reading parts are Chapter 2.

Summary

There are many types of learning algorithms that can be arranged into three main classes.
(i)             Supervised learning
The learning role is provided with proper inputs and outputs.
(ii)           Reinforcement learning
The algorithm is only given inputs and a grade.
(iii)         Unsupervised learning
The weights and biases are adjusted in response to network inputs only.

Perceptron Architecture: hardlimit transfer function
, one decision boundary .
The error e for kth iteration is given by
and the adjustments to weights and bias is given by
, .

ADALINE Architecture:
, one decision boundary .
The algorithm will adjust the weights and biases of the ADALINE architecture by minimizing the mean square error between the targeted output and the computed output, defined by .
The adjustments for kth iteration is given by
,
where  is a constant known as learning factor.

Both the ADALINE and perceptron have the same limitation that they could classify only linearly separable problems.
The LMS algorithm is optimal for a single neuron because the mean squared error surface for a single neuron has only one minimum point and constant curvature, providing a unique solution.
However, the LMS algorithm fails to produce a unique solution in multilayer networks.
The credit assignment problem was solved using the error back-propagation (BP) method which is generalization of the LMS algorithm.

Steepest descent algorithm: -∇F(a)。
LMBP algorithm:
The main drawback of the algorithm is the need for large memory and storage space of the free parameters in the computers.
Overfitting is the result of more hidden neurons than is actually necessary. However, if the number of hidden neurons is less than the optimum number then the network is unable to learn the correct input output mapping. Hence it is important to determine the optimum number of hidden neurons for a given problem.

Generalization is influenced by three factors:
The size of the training set
The architecture of the neural network
The complexity of the problem at hand
This problem can be solved in terms of the Vapnik-Chervonenkis (VC) dimension, which is a measure of the capacity or expressive power of the family of classification functions realized by a network. It can be defined as the maximum number of training examples for which a function can correctly classify all the patterns in a test dataset.
An accuracy of (1-e) needs the number of training samples to be w/e.



Tomorrow, I will read more of the dissertation.