问题描述
问题:我需要将几个不同的大数据帧(例如50k行)分成较小的块,每个数据帧的行数相同。但是,我不想手动设置每个数据集的块的大小。相反,我想要的代码:
The problem: I need to divide several different, large dataframes (e.g. 50k rows) into smaller chunks which each have the same number of rows. However, I don't want to have to manually set the size of the chunks for each dataset. Instead, I want code that:
- 检查数据帧的长度,并确定大约几千行的块数
原始数据框可以分解为 - 最小化必须丢弃的剩余行数
这里提供的答案是相关的:
The answers provided here are relevant: Split a vector into chunks in R
但是,我不想手动设置块大小。我想要代码找到最小化剩余的最佳块大小。
However, I don't want to have to manually set a chunk size. I want the code to find the "optimal" chunk size that will minimize the remainder.
示例:(基于Harlan在上述链接的答案)
Example: (Based on Harlan's answer at above link)
df <- rnorm(20752)
max <- 20
x <- seq_along(df)
df <- split(df, ceiling(x/max))
str(df)
> List of 5
> $ 1: num [1:5000] -1.4 -0.496 -1.185 -2.071 -1.118 ...
> $ 2: num [1:5000] 0.522 1.607 -2.228 -2.044 0.997 ...
> $ 3: num [1:5000] 0.295 0.486 -1.085 0.515 0.96 ...
> $ 4: num [1:5000] 0.695 -0.58 -1.676 1.052 1.266 ...
> $ 5: num [1:752] -0.6468 0.1731 0.5788 -0.0584 0.8479 ...
如果我选择了一个大小为4100行的块,我将有5个块,剩余的252行。这更可取,因为我会丢弃更少的数据点。只要这块块至少有几千行,我不在乎他们的大小。
If I had chosen a chunk size of 4100 rows, I would have 5 chunks with a remainder of 252 rows. That's more desirable because I would discard fewer datapoints. As long as the chunks are a few thousand rows at least, I don't care exactly what size they are.
推荐答案
这里是一个强力的方法(但非常快):
Here's a brute force approach (but very fast) :
# number of rows of your data.frame (from your example... )
nrows <- 20752
# acceptable range for sub-data.frame size
subSetSizes <- 4000:10000
remainders <- nrows %% subSetSizes
minIndexes <- which(remainders == min(remainders))
chunckSizesHavingMinRemainder <- subSetSizes[minIndexes]
# > chunckSizesHavingMinRemainder
# [1] 5188
# the remainder of 20752 / 5188 is indeed 0 (the only minimum)
# nrows %% 5188
# > [1] 0
这篇关于根据数据帧的长度将数据帧分解成相等的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!