问题描述
我构建了一个脚本,该脚本适用于小型数据集(小于100万行),而对于大型数据集则表现非常差。我听说数据表比小标题更有效。除了了解数据表之外,我还想了解其他速度优化。
I built a script that works great with small data sets (<1 M rows) and performs very poorly with large datasets. I've heard of data table as being more performant than tibbles. I'm interested to know about other speed optimizations in addition to learn about data tables.
我将在脚本中共享一些示例命令。在每个示例中,数据集都是10到1500万行和10到15列。
I'll share a couple of commands in the script for examples. In each of the examples, the datasets are 10 to 15 million rows and 10 to 15 columns.
- 获取由9个变量分组的数据框的最低日期
dataframe %>%
group_by(key_a, key_b, key_c,
key_d, key_e, key_f,
key_g, key_h, key_i) %>%
summarize(min_date = min(date)) %>%
ungroup()
- 在两个数据框上进行左联接以添加额外的列
merge(dataframe,
dataframe_two,
by = c("key_a", "key_b", "key_c",
"key_d", "key_e", "key_f",
"key_g", "key_h", "key_i"),
all.x = T) %>%
as_tibble()
- 加入在两个数据框
- Joining two dataframes on the closest date
dataframe %>%
left_join(dataframe_two,
by = "key_a") %>%
group_by(key_a, date.x) %>%
summarise(key_z = key_z[which.min(abs(date.x - date.y))]) %>%
arrange(date.x) %>%
rename(day = date.x)
我可以采取哪些最佳做法应用,特别是我该怎么做才能针对大型数据集优化这些类型的函数?
What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?
-
这是示例数据集
set.seed(1010)
library("conflicted")
conflict_prefer("days", "lubridate")
bigint <- rep(
sample(1238794320934:19082323109, 1*10^7)
)
key_a <-
rep(c("green", "blue", "orange"), 1*10^7/2)
key_b <-
rep(c("yellow", "purple", "red"), 1*10^7/2)
key_c <-
rep(c("hazel", "pink", "lilac"), 1*10^7/2)
key_d <-
rep(c("A", "B", "C"), 1*10^7/2)
key_e <-
rep(c("D", "E", "F", "G", "H", "I"), 1*10^7/5)
key_f <-
rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)
key_g <-
rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)
key_h <-
rep(c("tree", "plant", "animal", "forest"), 1*10^7/3)
key_i <-
rep(c("up", "up", "left", "left", "right", "right"), 1*10^7/5)
sequence <-
seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "1 day")
date_sequence <-
rep(sequence, 1*10^7/(length(sequence) - 1))
dataframe <-
data.frame(
bigint,
date = date_sequence[1:(1*10^7)],
key_a = key_a[1:(1*10^7)],
key_b = key_b[1:(1*10^7)],
key_c = key_c[1:(1*10^7)],
key_d = key_d[1:(1*10^7)],
key_e = key_e[1:(1*10^7)],
key_f = key_f[1:(1*10^7)],
key_g = key_g[1:(1*10^7)],
key_h = key_h[1:(1*10^7)],
key_i = key_i[1:(1*10^7)]
)
dataframe_two <-
dataframe %>%
mutate(date_sequence = ymd(date_sequence) + days(1))
sequence_sixdays <-
seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "6 days")
date_sequence <-
rep(sequence_sixdays, 3*10^6/(length(sequence_sixdays) - 1))
key_z <-
sample(1:10000000, 3*10^6)
dataframe_three <-
data.frame(
key_a = sample(key_a, 3*10^6),
date = date_sequence[1:(3*10^6)],
key_z = key_z[1:(3*10^6)]
)
推荐答案
使用 data.table
包
library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)
1
分组依据许多列是data.table的优点,
请参见第二个图底部的barchart,以与dplyr spark和其他类似的分组进行比较
by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]
请注意,我将 date
更改为 date_sequence
,我想您的意思是作为列名
note I changed date
to date_sequence
, I think you meant that as a column name
目前尚不清楚关于要合并表的字段,dataframe_two没有指定的字段,因此查询无效
请澄清
it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify
data.table具有一种非常有用的联接类型,称为滚动联接,它可以完全满足您的需求
data.table has very useful type of join called rolling join, which does exactly what you need
a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__, :
# Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.
这会导致错误。错误实际上非常有用。在您的真实数据上可能工作得非常好,因为错误(匹配行的基数)背后的原因可能与生成示例数据的过程有关。拥有良好的虚拟数据进行连接非常棘手。
如果您遇到相同的错误在您的真实数据上,您可能希望查看该查询的设计,因为它试图通过进行多对多联接来使行爆炸。即使在仅考虑单个 date_sequence
身份之后(服用 roll
帐户)。我认为这种问题不适用于该数据(严格来说是联接字段的基准)。您可能需要在工作流中引入数据质量检查层,以确保在 key_a
和 date_sequence
上没有重复项。
It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining.If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence
identity (taking roll
into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a
and date_sequence
combined.
这篇关于R中对大数据集有哪些有用的优化方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!