问题描述
这几天,我一直在使用800万个寄存器的data.frame,并且我需要改进一个循环来分析这些数据。
These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data.
我将描述我要解决的问题的每个过程。
首先,我必须按三个字段ClientID,日期和时间以升序排列所有data.frame。 (检查)
然后,使用排列的data.frame,我必须操作每个观察值之间的差异,只有在ClientID相同的情况下,才可以执行此操作。例如:
I will describe each process of the problem that I am trying to solve.First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check)Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example:
ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030
根据数据,我要获取的结果如下:
According to the data up, the result that I want to obtain is the following:
ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30
现在的问题是,使用8M观察的data.frame分析所有这些,大约需要3天。我希望我可以并行化这个过程。我的想法是可以按群集对data.frame进行分段,但是这种分段可以按顺序进行,而不是随机进行,然后使用foreach库或其他库,可以对群集进行分析并将其设置为可用的核心数。例如:
The problem now is that, analyzing all this with a data.frame of 8M observations, it takes like 3 days. I wish I could parallelize this process. My idea is that the data.frame could be segmented by clusters, but this segmentation could be in order and not randomly, and then using the library foreach or another library, could take by clusters the analysis and set it to the number of cores available. For example:
Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030
推荐答案
我不建议尝试并行化它。使用 data.table
包并使用以整数格式存储的时间,这将花费相当少的时间。
I wouldn't recommend trying to parallelize this. Using the data.table
package and working with times stored in an integer format this should take a pretty trivial amount of time.
library(data.table)
## Generate Data
RowCount <- 8e6
GroupCount <-1e4
DT <- data.table(ClientID = paste0("Client ",sample.int(GroupCount,size = RowCount, replace = TRUE)),
Time = sample.int(12,size = RowCount, replace = TRUE)*900)
DT[, Time := cumsum(Time), keyby = .(ClientID)]
DT[, Time := as.POSIXct(Time, tz = "UTC", origin = "1970-01-01 00:00:00")]
print(DT)
给予
ClientID Time
1: Client 1 1970-01-01 02:30:00
2: Client 1 1970-01-01 04:00:00
3: Client 1 1970-01-01 05:30:00
4: Client 1 1970-01-01 07:00:00
5: Client 1 1970-01-01 10:00:00
---
7999996: Client 9999 1970-02-20 18:15:00
7999997: Client 9999 1970-02-20 18:30:00
7999998: Client 9999 1970-02-20 21:00:00
7999999: Client 9999 1970-02-20 22:45:00
8000000: Client 9999 1970-02-21 00:30:00
计算时间差
Calculate time differences
system.time({
## Create a integer column that stores time as the number of seconds midnight on 1970
DT[,Time_Unix := as.integer(Time)]
## Order by ClientID then Time_Unix
setkey(DT, ClientID, Time_Unix)
## Calculate Elapsed Time in minutes between rows, grouped by ClientID
DT[, Elapsed_Minutes := (Time_Unix - shift(Time_Unix, n = 1L, type = "lag", fill = NA))/60L, keyby = .(ClientID)]
## Clean up the integer time
DT[,Time_Unix := NULL]
})
...
user system elapsed
0.416 0.025 0.442
结果:
Results:
print(DT)
...
ClientID Time Elapsed_Minutes
1: Client 1 1970-01-01 02:30:00 NA
2: Client 1 1970-01-01 04:00:00 90
3: Client 1 1970-01-01 05:30:00 90
4: Client 1 1970-01-01 07:00:00 90
5: Client 1 1970-01-01 10:00:00 180
---
7999996: Client 9999 1970-02-20 18:15:00 135
7999997: Client 9999 1970-02-20 18:30:00 15
7999998: Client 9999 1970-02-20 21:00:00 150
7999999: Client 9999 1970-02-20 22:45:00 105
8000000: Client 9999 1970-02-21 00:30:00 105
这篇关于并行分析以循环分析data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!