问题描述
我有两个 data.frames,每个都有一个时间序列.我的目标是使用 df2 的时间序列来标记 df1 中最近的时间戳.df2 中的每个时间戳只能在 df1 中标记一个时间戳!
I have two data.frames each with a time series. My goal ist to use the timeseries of df2 to mark the closest timestamp in df1. Each timestamp in df2 should only mark ONE timestamp in df1!
dates1 <- as.POSIXct(c("2015-10-26 12:00:00","2015-10-26 13:00:00","2015-10-26 14:00:00"))
values1 <- c("a","b","c")
dates2 <- as.POSIXct(c("2015-10-26 12:05:00","2015-10-26 13:55:00"))
values2 <- c("A","C")
df1 <- data.frame(dates1, values1)
df2 <- data.frame(dates2, values2)
预期结果:
dates2 values2 values1
1: 2015-10-26 12:00:00 A a
2: 2015-10-26 13:00:00 NA b
3: 2015-10-26 14:00:00 C c
为了实现这一点,我将 data.frames 转换为 data.tables 并使用最近的"滚动连接:
In order to achieve this, I'm converting the data.frames to data.tables and using the rolling joing "nearest" like so:
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,"dates1")
setkey(dt2,"dates2")
dt3 <- dt2[dt1,roll = "nearest"]
dates2 values2 values1
1: 2015-10-26 12:00:00 A a
2: 2015-10-26 13:00:00 A b
3: 2015-10-26 14:00:00 C c
values2 "A" 使用了两次,一次用于 12:00 时间戳,一次用于 13:00.我希望每个 value2 只使用一次,并参考 data.table 手册我希望使用选项 mult = "first" 来解决这个问题.
The values2 "A" is used twice, once for the 12:00 timestamp and once for 13:00. I want each value2 to be used only once, and consulting the data.table manual I would expect to solve this problem with using the option mult = "first".
dt3 <- dt2[dt1,roll = "nearest", mult = "first"]
这会导致相同的输出,A"被使用了两次.我的错误在哪里?
This results in the same output, "A" is used twice. Where is my mistake?
推荐答案
当运行 dt2[dt1, roll = "nearest"]
时,您基本上是在说从 dt2
根据使用键在dt1
中与每一行的最近连接.所以
When running dt2[dt1, roll = "nearest"]
you are basically saying "return the rows from dt2
according to the nearest join to each row in dt1
using the key. So
dates2
in row one indt2
离dates1
在 one 行最近dt1
中的强>dates2
in one indt2
与dates1
在 two 行最近dt1
中的强>dates2
in row two indt2
最接近dates1
行 threedt1
中的强>
dates2
in row one indt2
is the nearest todates1
in row one indt1
dates2
in row one indt2
is the nearest todates1
in row two indt1
dates2
in row two indt2
is the nearest todates1
in row three indt1
因此,
dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c
所有来自dt1
的行与来自dt2
的连接values2
.
Which are all the rows from dt1
with the joined values2
from dt2
.
相反,我们想以相反的方式加入,即根据每一行根据最近的联接从dt2
中提取values2
"strong> 在 dt2
中使用键更新 dt1
" 中匹配的行,即
Instead, we want to join the other way around, namely "extract values2
from dt2
according to the nearest join by each row in dt2
using the key and update the matched rows in dt1
", namely
dt1[dt2, roll = "nearest", values2 := i.values2]
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C
一些附加说明
Some additional notes
- 你不需要先包装到
data.frame
然后再到data.table
,你可以只做dt1 <- data.table(dates1, values1)
等 - 当你在做的时候,你已经可以使用
key
参数data.table
即时设置密钥,即dt1 <- data.table(date1, values1, key = "dates1")
等 - 或者你可以一起跳过设置键并使用
on
代替(V 1.9.6+),即dt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- 最后,请不要制作不必要的副本,例如,不要使用
<-
和data.table(df)
使用:=
和setDT(df)
,请参阅此处 了解更多信息
- You don't need to wrap first to
data.frame
and then todata.table
, you can just dodt1 <- data.table(dates1, values1)
and etc. - While you at it, you can already set the key on the fly using
key
parameterdata.table
, namelydt1 <- data.table(dates1, values1, key = "dates1")
and etc. - Or you can skip setting keys all together and use
on
instead (V 1.9.6+), namelydt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- Finally, please refrain from making unnecessary copies, e.g., instead of
<-
anddata.table(df)
use:=
andsetDT(df)
, see here for more information
这篇关于R data.table 滚动连接“mult"没有按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!