问题描述
我有以下几种方法,用于执行分层k-折交叉验证的逻辑的一部分。
I have the following methods, part of the logic for performing stratified k-fold crossvalidation.
private static IEnumerable<IEnumerable<int>> GenerateFolds(
IClassificationProblemData problemData, int numberOfFolds)
{
IRandom random = new MersenneTwister();
IEnumerable<double> values = problemData.Dataset.GetDoubleValues(problemData.TargetVariable, problemData.TrainingIndices);
var valuesIndices =
problemData.TrainingIndices.Zip(values, (i, v) => new { Index = i, Value = v });
IEnumerable<IEnumerable<IEnumerable<int>>> foldsByClass =
valuesIndices.GroupBy(x => x.Value, x => x.Index)
.Select(g => GenerateFolds(g, g.Count(), numberOfFolds));
var enumerators = foldsByClass.Select(x => x.GetEnumerator()).ToList();
while (enumerators.All(e => e.MoveNext()))
{
var fold = enumerators.SelectMany(e => e.Current).OrderBy(x => random.Next());
yield return fold.ToList();
}
}
折叠代:
Folds generation:
private static IEnumerable<IEnumerable<T>> GenerateFolds<T>(
IEnumerable<T> values, int valuesCount, int numberOfFolds)
{
// number of folds rounded to integer and remainder
int f = valuesCount / numberOfFolds, r = valuesCount % numberOfFolds;
int start = 0, end = f;
for (int i = 0; i < numberOfFolds; ++i)
{
if (r > 0)
{
++end;
--r;
}
yield return values.Skip(start).Take(end - start);
start = end;
end += f;
}
}
通用 GenerateFolds<吨T>
到的IEnumerable
第一个序列,根据方法只是一个的IEnumerable<分裂指定数目的褶皱。举例来说,如果我有101训练样本,将10
The generic GenerateFolds<T
method simply splits an IEnumerable<T>
into a sequence of IEnumerable
s according to the specified number of folds. For example, if I had 101 training samples, it would generate one fold of size 11 and 9 folds of size 10.
以上这组的方法基于类样本生成大小11 1倍大小的9倍之多值,将每个组成褶皱的指定数目,然后加入逐类折叠成最终的褶皱,确保类标签的相同的分布。
The method above it groups the samples based on class values, splits each group into the specified number of folds and then joins the by-class folds into the final folds, ensuring the same distribution of class labels.
我的问题关于行收益回报fold.ToList()
。正因为如此,该法正常工作,如果我删除了ToList()
然而,结果是不再正确。在我的测试情况下,我有641训练样本和10倍,这意味着第一折应该是大小65和64尺寸的剩余褶皱但是当我删除了ToList()
,所有的褶皱尺寸64和类标签不正确分配。任何想法,为什么?谢谢
My question regards the line yield return fold.ToList()
. As it is, the method works correctly, if I remove the ToList()
however, the results are no longer correct. In my test case I have 641 training samples and 10 folds, which means the first fold should be of size 65 and the remaining folds of size 64. But when I remove ToList()
, all the folds are of size 64 and class labels are not correctly distributed. Any ideas why? Thank you.
推荐答案
让我们觉得是什么折叠
变量:
Lets think what is fold
variable:
var fold = enumerators.SelectMany(e => e.Current).OrderBy(x => random.Next());
这不是查询执行的结果。这是一个的查询定义的。因为无论的SelectMany
和排序依据
与执行的递延方式运营。所以,它只是节省了约压扁知识的电流的全体普查项目,并以随机顺序返回他们。我强调一句话的电流的,因为它是在查询执行的时间目前的项目。
It is not a result of query execution. It's a query definition. Because both SelectMany
and OrderBy
are operators with deferred manner of execution. So, it just saves knowledge about flattening current items from all enumerators and returning them in random order. I have highlighted word current, because it's current item at the time of query execution.
现在让当此查询将被执行的想法。的 GenerateFolds
的方法执行时的IEnumerable
的结果的IEnumerable< INT>
的查询的。下面的代码不执行任何疑问的:
Now lets think when this query will be executed. Result of GenerateFolds
method execution is IEnumerable
of IEnumerable<int>
queries. Following code does not execute any of queries:
var folds = GenerateFolds(indices, values, numberOfFolds);
这又只是一个查询。你可以通过调用执行它了ToList()
或枚举它:
var f = folds.ToList();
不过,即使是现在的内层查询不被执行。他们都回来了,但不执行。即,而
,而你保存的查询到列表中环路 GenerateFolds
已执行˚F
。和 e.MoveNext()
,直至退出循环被多次调用:
But even now inner queries are not executed. They are all returned, but not executed. I.e. while
loop in GenerateFolds
has been executed while you saved queries to the list f
. And e.MoveNext()
has been called several times until you exited loop:
while (enumerators.All(e => e.MoveNext()))
{
var fold = enumerators.SelectMany(e => e.Current).OrderBy(x => random.Next());
yield return fold;
}
那么,是什么˚F
持有?它拥有查询列表。就这样,你已经得到了他们的所有,的电流的产品从每个枚举(记住最后一个项目 - 我们已经迭代完全,而
循环在这一点上的时间)。但是,这些查询是尚未执行!在这里,您首先将它们的执行:
So, what f
holds? It holds list of queries. And thus you have got them all, current item is the last item from each enumerator (remember - we have iterated while
loop completely at this point of time). But none of these queries is executed yet! Here you execute first of them:
f[0].Count()
您获得通过的第一个查询(在问题的顶部定义)返回的项目数。但是,这样你已经列举的所有查询当前项是最后一项。而你在最后一个项目的索引数。
You get count of items returned by first query (defined at the top of question). But thus you already enumerated all queries current item is the last item. And you get count of indexes in last item.
现在采取
folds.First().Count()
在这里,你不枚举所有查询将它们保存在列表中。即,而
循环只执行一次和电流的项目是第一项。这就是为什么你在第一项指标的数量。 。这就是为什么这些价值是不同的。
Here you don't enumerate all queries to save them in list. I.e. while
loop is executed only once and current item is the first item. That's why you have count of indexes in first item. And that's why these values are different.
最后一个问题 - 为什么当你添加了ToList()
里面的一切工作正常你的,而
循环。答案很简单 - 一个执行每个查询。你有指标,而不是查询定义的列表。每个查询在每次迭代执行,从而电流的产品总是不同的。而你的代码工作正常。
Last question - why all works fine when you add ToList()
inside your while
loop. Answer is very simple - that executes each query. And you have list of indexes instead of query definition. Each query is executed on each iteration, thus current item is always different. And your code works fine.
这篇关于想了解LINQ /延迟执行工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!