R - caret createDataPartition 返回比预期更多的样本

本文介绍了R - caret createDataPartition 返回比预期更多的样本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将 iris 数据集拆分为训练集和测试集.我像这样使用 createDataPartition():

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:

library(caret)
createDataPartition(iris$Species, p=0.1)
# [1]  12  22  26  41  42  57  63  79  89  93 114 117 134 137 142

createDataPartition(iris$Sepal.Length, p=0.1)
# [1]   1  27  44  46  54  68  72  77  83  84  93  99 104 109 117 132 134

我理解第一个查询.我有一个 0.1*150 个元素的向量(150 是数据集中的样本数).但是，我应该在第二个查询中使用相同的向量，但我得到的向量包含 17 个元素而不是 15 个.

I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.

关于我为什么得到这些结果的任何想法?

Any ideas as to why I get these results?

推荐答案

Sepal.Length 是一个数值特征；来自在线文档:

Sepal.Length is a numeric feature; from the online documentation:

对于数字 y，样本会根据百分位数分成几组部分，并在这些子组内进行抽样.对于 createDataPartition，百分位数通过 groups 参数设置.

groups:对于数字y，分位数中的断点数

groups: for numeric y, the number of breaks in the quantiles

使用默认值:

groups = min(5, length(y))

您的情况如下:

由于你没有指定groups，它的值是min(5, 150) = 5个breaks；现在，在这种情况下，这些中断与自然分位数一致，即最小值、第一个分位数、中位数、第三个分位数和最大值 - 您可以从 summary 中看到:p>

Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:

> summary(iris$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.300   5.100   5.800   5.843   6.400   7.900

对于数字特征，该函数将从上述中断(分位数)定义的(4)个区间中的每个中获取一定百分比的p = 0.1；让我们看看每个这样的时间间隔有多少样本:

For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:

l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8))  # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4))  # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9))  # 35

每个间隔将返回多少样本?这是捕获 - 根据源代码，它将是产品的天花板.样本和您的p；让我们看看 p = 0.1 的情况应该是什么:

Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:

ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17

宾果！:)

这篇关于R - caret createDataPartition 返回比预期更多的样本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！