在下面的示例中,我有一个用户表和一个事务表,其中一个用户可以有0个,1个或多个事务。我在用户表上使用mult='first'
执行join + update,以尝试插入一列,该列指示每个用户的首次发生交易的日期。
library(data.table) # v1.10.4
# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")
# Convert date columns to Date type
users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
transactions[, TransactionDate := as.Date(TransactionDate)]
users
UserID User Gender Registered Cancelled FirstTransactionDate
1: 1 Charles male 2012-12-21 <NA> 2012-08-26
2: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23
3: 3 Caroline female 2012-10-23 2016-06-07 2016-05-08
4: 4 Brielle female 2013-07-17 <NA> <NA>
5: 5 Benjamin male 2010-11-25 <NA> <NA>
transactions
TransactionID TransactionDate UserID ProductID Quantity
1: 1 2010-08-21 7 2 1
2: 2 2011-05-26 3 4 1
3: 3 2011-06-16 3 3 1
4: 4 2012-08-26 1 2 3
5: 5 2013-06-06 2 4 1
6: 6 2013-12-23 2 5 6
7: 7 2013-12-30 3 4 1
8: 8 2014-04-24 NA 2 3
9: 9 2015-04-24 7 4 3
10: 10 2016-05-08 3 4 4
##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]
# Unexpected result
users[UserID == 2]
UserID User Gender Registered Cancelled FirstTransactionDate
1: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23 # <- shouldn't this be 2013-06-06?
当事务表中较早的事务绑定到该用户时,为什么为用户2设置FirstTransactionDate
2013-12-23
?这是一个错误吗? 最佳答案
仔细阅读data.table
的mult
文档,它说:
当i是列表(或data.frame或data.table)且x中有多行时
匹配到i中的行,并返回多个控件:“全部”
(默认),“第一个”或“最后一个”。
因此,如果X(users
)中有多行与i(transactions
)匹配,则mult
将返回X中的第一行。但是,在您的情况下,X中没有多行与i匹配,而是i中有多个与X匹配的行。
正如@Arun所建议的那样,最好的选择是更改您的周围,以使mult = "first"
相关:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
另一种选择是稍微更改合并:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
我只是在
transactions
数据集中创建第一个交易日期。可以多次合并,但这应该没问题,因为UserID
的值始终相同。