问题描述
我正在尝试读取数十万个 JSON 文件,并最终将它们放入 dplyr 对象中.但是 JSON 文件不是简单的键值解析,它们需要大量的预处理.预处理是编码的,并且对效率有很好的作用.但我面临的挑战是将每条记录有效地加载到单个对象(data.table 或 dplyr 对象)中.
I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
这是非常稀疏的数据,我将有 2000 多个变量,这些变量大部分都将丢失.每条记录可能有一百个变量集.变量将是字符、逻辑和数字的混合体,我知道每个变量的模式.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
我认为避免 R 为每次更新(或一次添加一行)复制对象的最佳方法是创建一个空数据框,然后在从 JSON 文件中提取特定字段后更新它们.但是在数据框中执行此操作非常慢,移动到数据表或 dplyr 对象要好得多,但仍希望将其减少到几分钟而不是几小时.请参阅下面的示例:
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
我认为每个列的模式可能已在每次更新时设置和重置,但我也通过预先设置每个列类型来尝试这样做,但这没有帮助.
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
对我来说,使用 data.frame(上面已注释掉)运行此示例大约需要 22 秒,转换为 data.table 需要 5 秒.我希望有人知道幕后发生的事情,并能提供一种更快的方法来在这里填充数据表.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
推荐答案
除了你构造 sparse.val
的部分之外,我遵循你的代码.您分配列的方式存在小错误.在尝试优化时不要忘记检查答案是否正确:).
I follow your code except the part where you construct sparse.val
. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
既然您说您已经知道列的类型,那么预先生成正确的类型很重要.否则,当你这样做: DT[, LHS := RHS]
和 RHS
类型不等于 LHS
时,RHS 将被强制为LHS 的类型.在您的情况下,所有数字和字符值都将转换为逻辑值,因为所有列都是逻辑类型.这不是你想要的.
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS]
and RHS
type is not equal to LHS
, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
因此,创建矩阵无济于事(所有列都属于同一类型)+ 它也很慢.相反,我会这样做:
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
现在我们设置了正确的类型.接下来,我觉得应该是 i >= 402L &我<801L代码>.否则,您将前 401 列分配为逻辑列,然后将前 801 列分配为数字,鉴于您预先知道列的类型,这没有多大意义,对吧?
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L
. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
行:
names(sparse.val) <- paste0("A", sparse.cols)
会创建一个副本,并不是真正需要的.因此我们将删除这一行.
will create a copy and is not really necessary. Therefore we'll delete this line.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
实际上并没有做你认为它正在做的事情.它没有将 val
中的值分配给分配给 x
的名称.相反,它(每次)(过度)写入名为 x
的列.检查你的输出.
is not actually doing what you think it's doing. It's not assigning the values from val
to the name assigned to x
. Instead it's (over)writing (each time) to a column named x
. Check your output.
这不是优化的一部分.这只是为了让您知道您在这里真正想要做什么.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
注意 x
周围的 (
.现在,它将被评估,并且 x
中包含的值将是 x
所在的列code>val 的值将被分配给.这有点微妙,我理解.但是,这是必要的,因为它允许将列 x
创建为 DT[, x := val]
您实际希望将 val
分配给 x
的位置.
Note the (
around x
. Now, it'll be evaluated and the value contained in x
will be the column to which val
's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x
as DT[, x := val]
where you actually want val
to be assigned to x
.
回到优化,好消息是,您的 for 循环耗时很简单:
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
这就是 data.table
的 sub-assign by reference 功能派上用场的地方!
This is where data.table
's sub-assign by reference feature comes in handy!
你的最终函数如下所示:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
通过这样做,您的解决方案在我的系统上需要 9.84 秒,而上面的函数需要 0.34 秒,这是大约 29 倍的改进.我认为这是您正在寻找的结果.请验证.
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH
这篇关于稀疏数据帧直接更新(替换)速度慢,效率低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!