问题描述
我有一个data.table DT
与当前( F0YR
)和下一个( F1YR
)会计年度末(FYE)编码为整数。由于下一个FYE将最终变为
a当前FYE,整数将在 F1YR
和 F0YR
。此外,我的数据包含每月观察,因此相同的FYE将在数据集
多次:
I have a data.table DT
with the current (F0YR
) and the next (F1YR
) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually becomea current FYE, the integer will be both in the column F1YR
and F0YR
. Also, my data contains monthly observations so the same FYE will be in the data setmultiple times:
library(data.table)
DT <- data.table(ID = rep(c("A", "B"), each=9),
MONTH = rep(100L:108L, times=2),
F0YR = rep(c(1L, 4L, 7L), each=3, times=2),
F1YR = rep(c(4L, 7L, 9L), each=3, times=2),
value = c(rep(1:5, each=3), 6, 6, 7),
key = "ID,F0YR")
DT
ID MONTH F0YR F1YR value
[1,] A 100 1 4 1
[2,] A 101 1 4 1
[3,] A 102 1 4 1
[4,] A 103 4 7 2
[5,] A 104 4 7 2
[6,] A 105 4 7 2
[7,] A 106 7 9 3
[8,] A 107 7 9 3
[9,] A 108 7 9 3
[10,] B 100 1 4 4
[11,] B 101 1 4 4
...
我想做什么
对于每个 ID
和 F1YR
组合,我想获取 ID
和 F0YR
组合。作为示例:对于 FOYR == 4
,公司A具有值 2
。现在,
我想要一个额外的列与 ID ==A
和 F1YR == 4 $ c $
What I want to do
For every ID
and F1YR
combination, I want to get the value for the ID
and F0YR
combination. As an example: Company A had a value of 2
for FOYR==4
. Now, I want an additional column for all combinations with ID=="A"
and F1YR==4
which is set to 2, next to the already existent value of 1.
intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
ID F1YR valueNew MONTH F0YR value
[1,] A 4 2 100 1 1
[2,] A 4 2 101 1 1
[3,] A 4 2 102 1 1
[4,] A 7 3 103 4 2
[5,] A 7 3 104 4 2
[6,] A 7 3 105 4 2
[7,] A 9 NA 106 7 3
[8,] A 9 NA 107 7 3
[9,] A 9 NA 108 7 3
[10,] B 4 5 100 1 4
[11,] B 4 5 101 1 4
...
(请注意,我使用 mult =last
这看起来很可行。首先,我必须复制我的DT。第二,因为我基本上加入了同样的 data.table
,所有的列名都有相同的名字
,我必须重命名它们。我认为一个 self join
将是向前的方式,但我试图,尝试,不能得到一个很好的解决方案。我有希望
,有一些容易,我只是没有看到...有人有线索吗?或者是我的数据设置的方式,其实是硬
(也许是因为我有每月的观察,但只想加入每季度或每年更改的值)。
This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table
, all the column names have the same nameand I have to rename them. I thought that a self join
would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hopethat there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard(maybe because I have monthly observations, but want to join only quarterly or yearly changing values).
推荐答案
在这样的用例中,口头语聚合首先,然后加入经常有帮助。所以,从 DT
开始,并使用v1.8.1:
In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT
, and using v1.8.1 :
> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
ID F0YR V1
1: A 1 1
2: A 4 2
3: A 7 3
4: B 1 4
5: B 4 5
6: B 7 7
我叫它 agg
因为我不能想到一个更好的名字。在这种情况下,你想要最后
这不是一个真正的聚合,因此,但你知道我的意思。
I called it agg
because I couldn't think of a better name. In this case you wanted last
which isn't really an aggregate as such, but you know what I mean.
然后按组的引用更新 DT
。这里我们按 i
分组。
Then update DT
by reference by group. Here we're grouping by i
.
setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
ID MONTH F0YR F1YR value newcol
1: A 100 1 4 1 2
2: A 101 1 4 1 2
3: A 102 1 4 1 2
4: A 103 4 7 2 3
5: A 104 4 7 2 3
6: A 105 4 7 2 3
7: A 106 7 9 3 NA
8: A 107 7 9 3 NA
9: A 108 7 9 3 NA
10: B 100 1 4 4 5
11: B 101 1 4 4 5
12: B 102 1 4 4 5
13: B 103 4 7 5 7
14: B 104 4 7 5 7
15: B 105 4 7 5 7
16: B 106 7 9 6 NA
17: B 107 7 9 6 NA
18: B 108 7 9 7 NA
是吗?不确定我完全跟随。这些操作应该非常快,没有任何副本,应该扩展到大数据。至少,这是意图。
Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.
这篇关于如何最好地连接data.table的一列与同一data.table的另一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!