问题描述
我们正在编写一个小型的ANN,它应该基于10个输入变量将7000种产品分为7类.
We are writing a small ANN which is supposed to categorize 7000 products into 7 classes based on 10 input variables.
为此,我们必须使用k倍交叉验证,但我们有些困惑.
In order to do this we have to use k-fold cross validation but we are kind of confused.
我们有演示幻灯片的摘录:
We have this excerpt from the presentation slide:
确切的验证和测试集是什么?
据我们了解,我们要遍历3个训练集并调整权重(单个时期).那我们该怎么做验证呢?因为据我了解,测试集用于获取网络错误.
From what we understand is that we run through the 3 training sets and adjust the weights (single epoch). Then what do we do with the validation? Because from what I understand is that the test set is used to get the error of the network.
接下来发生的事情也让我感到困惑.交叉何时发生?
What happens next is also confusing to me. When does the crossover take place?
如果问得还不多,不胜感激的步骤清单
If it's not too much to ask, a bullet list of step would be appreciated
推荐答案
您似乎有些困惑(我记得我也是),所以我将为您简化事情. ;)
You seem to be a bit confused (I remember I was too) so I am going to simplify things for you. ;)
每当给您一项任务(例如设计神经网络)时,您通常也会获得一个样本数据集以用于训练目的.让我们假设您正在训练一个简单的神经网络系统Y = W · X
,其中Y
是通过计算权重向量W
与给定样本向量X
的标量积(·)得出的输出.现在,要想做到这一点,最简单的方法就是使用整个数据集(例如1000个样本)来训练神经网络.假设训练收敛并且权重稳定,那么您可以放心地说,您的网络将正确分类训练数据. 但是如果提供以前看不见的数据会给网络带来什么后果?显然,这种系统的目的是能够归纳和正确分类除用于训练的数据以外的其他数据.
Whenever you are given a task such as devising a neural network you are often also given a sample dataset to use for training purposes. Let us assume you are training a simple neural network system Y = W · X
where Y
is the output computed from calculating the scalar product (·) of the weight vector W
with a given sample vector X
. Now, the naive way to go about this would be using the entire dataset of, say, 1000 samples to train the neural network. Assuming that the training converges and your weights stabilise you can then safely say that you network will correctly classify the training data. But what happens to the network if presented with previously unseen data? Clearly the purpose of such systems is to be able to generalise and correctly classify data other than the one used for training.
但是,在任何现实情况下,只有在将神经网络部署到生产环境中后,才能使用以前看不见的/新的数据.但是,由于您尚未对其进行充分的测试,因此您可能会度过一段糟糕的时光. :)任何学习系统几乎都能完美匹配其训练集,但由于看不见的数据而不断失败的现象称为过度拟合.
In any real-world situation, however, previously-unseen/new data is only available once your neural network is deployed in a, let's call it, production environment. But since you have not tested it adequately you are probably going to have a bad time. :) The phenomenon by which any learning system matches its training set almost perfectly but constantly fails with unseen data is called overfitting.
这里是算法的验证和测试部分.让我们回到1000个样本的原始数据集.您要做的就是将其分为三组-培训,验证和测试(Tr
,Va
和Te
)-使用精心选择的比例. (80-10-10)%通常是一个很好的比例,其中:
Here come in the validation and testing parts of the algorithm. Let's go back to the original dataset of 1000 samples. What you do is you split it into three sets -- training, validation and testing (Tr
, Va
and Te
) -- using carefully selected proportions. (80-10-10)% is usually a good proportion, where:
-
Tr = 80%
-
Va = 10%
-
Te = 10%
Tr = 80%
Va = 10%
Te = 10%
现在发生的事情是,神经网络在Tr
集合上训练,并且其权重已正确更新.然后,将验证集Va
用于使用训练产生的权重来计算分类误差E = M - Y
,其中M
是从验证集获取的预期输出向量,而Y
是从验证集得到的计算输出.分类(Y = W * X
).如果错误高于用户定义的阈值,则整个.当使用验证集计算的误差被认为足够低时,该训练阶段便结束了.
Now what happens is that the neural network is trained on the Tr
set and its weights are correctly updated. The validation set Va
is then used to compute the classification error E = M - Y
using the weights resulting from the training, where M
is the expected output vector taken from the validation set and Y
is the computed output resulting from the classification (Y = W * X
). If the error is higher than a user-defined threshold then the whole training-validation epoch is repeated. This training phase ends when the error computed using the validation set is deemed low enough.
现在,明智的做法是在每次纪元迭代时从总集Tr + Va
中随机选择要用于训练和验证的样本.这样可以确保网络不会过度适应训练集.
Now, a smart ruse here is to randomly select which samples to use for training and validation from the total set Tr + Va
at each epoch iteration. This ensures that the network will not over-fit the training set.
然后,将测试集Te
用于测量网络的性能.此数据非常适合此目的,因为在整个培训和验证阶段都从未使用过.实际上,它是一小部分以前看不见的数据,应该模仿网络在生产环境中部署后将发生的情况.
The testing set Te
is then used to measure the performance of the network. This data is perfect for this purpose as it was never used throughout the training and validation phase. It is effectively a small set of previously unseen data, which is supposed to mimic what would happen once the network is deployed in the production environment.
如上所述,再次根据分类误差来测量性能.效果也可以(甚至应该)根据精确度和召回力来衡量知道错误发生的位置和方式,但这是另一个问答的主题.
The performance is again measured in term of classification error as explained above. The performance can also (or maybe even should) be measured in terms of precision and recall so as to know where and how the error occurs, but that's the topic for another Q&A.
了解了这种训练验证测试机制后,您可以通过执行 K折交叉验证.这在某种程度上是我上面解释的智能诡计的演变.这项技术涉及在不同,不重叠,等比例的Tr
,Va
和Te
集上进行K轮训练-验证-测试.
Having understood this training-validation-testing mechanism, one can further strengthen the network against over-fitting by performing K-fold cross-validation. This is somewhat an evolution of the smart ruse I explained above. This technique involves performing K rounds of training-validation-testing on, different, non-overlapping, equally-proportioned Tr
, Va
and Te
sets.
给出k = 10
,对于每个K值,您都将数据集划分为Tr+Va = 90%
和Te = 10%
,然后运行算法,记录测试性能.
Given k = 10
, for each value of K you will split your dataset into Tr+Va = 90%
and Te = 10%
and you will run the algorithm, recording the testing performance.
k = 10
for i in 1:k
# Select unique training and testing datasets
KFoldTraining <-- subset(Data)
KFoldTesting <-- subset(Data)
# Train and record performance
KFoldPerformance[i] <-- SmartTrain(KFoldTraining, KFoldTesting)
# Compute overall performance
TotalPerformance <-- ComputePerformance(KFoldPerformance)
显示的过拟合
我从下面的维基百科中提取了举世闻名的情节,以展示验证集如何帮助您防止过度安装.训练误差(蓝色)随着历元数的增加而趋于减小:因此,网络正在尝试精确匹配训练集.另一方面,红色的验证错误遵循不同的U形轮廓.曲线的最小值是理想情况下应停止训练的时间,因为这是训练和验证误差最小的点.
Overfitting Shown
I am taking the world-famous plot below from wikipedia to show how the validation set helps prevent overfitting. The training error, in blue, tends to decrease as the number of epochs increases: the network is therefore attempting to match the training set exactly. The validation error, in red, on the other hand follows a different, u-shaped profile. The minimum of the curve is when ideally the training should be stopped as this is the point at which the training and validation error are lowest.
有关更多参考,这本出色的书将为您提供既具有良好的机器学习知识,也有一些偏头痛.由您决定是否值得. :)
For more references this excellent book will give you both a sound knowledge of machine learning as well as several migraines. Up to you to decide if it's worth it. :)
这篇关于如何在神经网络中使用k折交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!