在下面的示例中,我有一个用户表和一个事务表,其中一个用户可以有0个,1个或多个事务。我在用户表上使用mult='first'执行join + update,以尝试插入一列,该列指示每个用户的首次发生交易的日期。

library(data.table)  # v1.10.4

# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")

# Convert date columns to Date type
    users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
    transactions[, TransactionDate := as.Date(TransactionDate)]

users
   UserID     User Gender Registered  Cancelled FirstTransactionDate
1:      1  Charles   male 2012-12-21       <NA>           2012-08-26
2:      2    Pedro   male 2010-08-01 2010-08-08           2013-12-23
3:      3 Caroline female 2012-10-23 2016-06-07           2016-05-08
4:      4  Brielle female 2013-07-17       <NA>                 <NA>
5:      5 Benjamin   male 2010-11-25       <NA>                 <NA>

transactions
    TransactionID TransactionDate UserID ProductID Quantity
 1:             1      2010-08-21      7         2        1
 2:             2      2011-05-26      3         4        1
 3:             3      2011-06-16      3         3        1
 4:             4      2012-08-26      1         2        3
 5:             5      2013-06-06      2         4        1
 6:             6      2013-12-23      2         5        6
 7:             7      2013-12-30      3         4        1
 8:             8      2014-04-24     NA         2        3
 9:             9      2015-04-24      7         4        3
10:            10      2016-05-08      3         4        4

##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]

# Unexpected result
users[UserID == 2]
   UserID  User Gender Registered  Cancelled FirstTransactionDate
1:      2 Pedro   male 2010-08-01 2010-08-08           2013-12-23  # <- shouldn't this be 2013-06-06?


当事务表中较早的事务绑定到该用户时,为什么为用户2设置FirstTransactionDate 2013-12-23?这是一个错误吗?

最佳答案

仔细阅读data.tablemult文档,它说:


当i是列表(或data.frame或data.table)且x中有多行时
匹配到i中的行,并返回多个控件:“全部”
(默认),“第一个”或“最后一个”。


因此,如果X(users)中有多行与i(transactions)匹配,则mult将返回X中的第一行。但是,在您的情况下,X中没有多行与i匹配,而是i中有多个与X匹配的行。

正如@Arun所建议的那样,最好的选择是更改您的周围,以使mult = "first"相关:

users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]

users
#   UserID     User Gender Registered  Cancelled FirstTransactionDate
#1:      1  Charles   male 2012-12-21       <NA>           2012-08-26
#2:      2    Pedro   male 2010-08-01 2010-08-08           2013-06-06
#3:      3 Caroline female 2012-10-23 2016-06-07           2011-05-26
#4:      4  Brielle female 2013-07-17       <NA>                 <NA>
#5:      5 Benjamin   male 2010-11-25       <NA>                 <NA>


另一种选择是稍微更改合并:

users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
      FirstTransactionDate := FirstTransactionDate, on="UserID"]


我只是在transactions数据集中创建第一个交易日期。可以多次合并,但这应该没问题,因为UserID的值始终相同。

10-08 08:09