What Is Cross-Validation?
Cross-validation is mainly used in modeling applications, such as PCR and PLS regression modeling. In a given modeling sample, take out most of the samples to build a model, leave a small part of the sample to forecast using the model just established, and find the forecast error of this small sample, and record their sum of squares.
- When using the training set to train parameters, it is often found that people usually divide an entire training set into three parts (such as the mnist handwriting training set). Generally divided into: training set (train_set), evaluation set (valid_set), test set (test_set) these three parts. This is actually set to ensure the training effect. The test set is well understood, in fact, it is data that does not participate in training at all, and is only used to observe the test results. The training and evaluation sets involve the following knowledge.
- Because in actual training, the degree of fit of the training result to the training set is usually quite good (sensitive to initial conditions), but the degree of fit to data outside the training set is usually not so satisfactory. Therefore, we usually do not use all the data sets for training, but we divide a part (this part does not participate in training) to test the parameters generated by the training set, and relatively objectively judge these parameters on the data outside the training set. Of compliance. This idea is called Cross Validation [1]
- Cross Validation, sometimes called Rotation Estimation, is a practical method for statistically cutting data samples into smaller subsets. The theory was proposed by Seymour Geisser.
- In a given modeling sample, take out most of the samples to build a model, leave a small part of the sample to forecast using the model just established, and find the forecast error of this small sample, and record their sum of squares. This process continues until all samples have been predicted once and only once. The sum of the prediction error squares of each sample is called PRESS (predicted Error Sum of Squares).
- The basic idea of cross-validation is to group the original data (dataset) in a certain sense, one part as the training set and the other as the validation set or test set.First, use the training set pair The classifier is trained, and then the training set is used to test the trained model, which is used as the performance index of the classifier.
- The purpose of cross-validation is to obtain a reliable and stable model. When building a PCR or PLS model, an important factor is the question of how many principal components to take. Use cross validation to verify the PRESS value under each principal component and select the number of principal components with a smaller PRESS value. Or the number of principal components whose PRESS value is no longer small.
- Commonly used accuracy testing methods are mainly cross-validation, such as 10-fold cross validation.
- Holdout verification
- In common sense, Holdout validation is not a cross-validation because the data is not cross-used. A portion is randomly selected from the initial sample to form cross-validation data, and the rest is used as training data. Generally, less than one-third of the original sample is selected as validation data.
- K-fold cross-validation
- K-fold cross-validation. The initial sampling is divided into K sub-samples. A single sub-sample is retained as the data for the verification model. The other K-1 samples are used for training. The cross-validation is repeated K times, and each sub-sample is verified once. The results of the average K times or other combination methods are used to finally obtain a single estimate. The advantage of this method is that it repeatedly uses randomly generated subsamples for training and verification at the same time, and the results are verified once each time. 10-fold cross-validation is the most commonly used [3] .
- Leave a verification
- As the name suggests, leave-one-out verification ( LOOCV ) means that only one of the original samples is used as verification data, and the rest is left as training data. This process continues until each sample is used as a verification data. In fact, this is the same as K- fold cross-validation, where K is the number of original samples. In some cases, efficient algorithms exist, such as using kernel regression and Tikhonov regularization.