r - 计算一起购买的产品

我有这样的DF:

现在我需要在R中提供一个像这样的可能性:

finding products that customers bought together

我的输出应如下所示:

ProductX ProductY Times
1        2        2
2        3        1
3        4        1

尼斯将是一个矩阵输出，例如:

      [,1] [,2] [,3] [,4]
[1,]    1    2    0    0
[2,]    2    1    1    0
[3,]    0    0    1    1
[4,]    0    0    1    1

我已经尝试过reshape2包，但我不知道如何获取此输出形式。

最佳答案

这是使用data.table和non-equi自联接的答案。 编辑:添加了allow.cartesian = TRUE以使其适用于大型数据集。

library(data.table)

dt[dt
    , on = .(ID = ID, Product < Product)
    , .(ProductX = x.Product, ProductY = i.Product)
    , nomatch = 0L
    , allow.cartesian = TRUE
   ][, .N, by = .(ProductX, ProductY)]

   ProductX ProductY N
1:        1        2 2
2:        2        3 1
3:        2        4 1
4:        3        4 1

相当于Dplyr:

library(dplyr)

inner_join(tib, tib, by = 'ID')%>%
  filter(Product.x < Product.y)%>%
  count(Product.x, Product.y)

  Product.x Product.y     n
      <dbl>     <dbl> <int>
1         1         2     2
2         2         3     1
3         2         4     1
4         3         4     1

这也是基本的R版本:

aggregate(ID ~ Product.x + Product.y
          , data = merge(df, df, by = 'ID')
          , subset = Product.x < Product.y
          , FUN = length)

# Need to change the names from ID to n
  Product.x Product.y ID
1         1         2  2
2         2         3  1
3         2         4  1
4         3         4  1

性能:

Unit: milliseconds
             expr     min       lq      mean  median       uq      max neval
           dt_way  3.9149  4.29330  4.593209  4.6597  4.80210   6.2326   100
 dplyr_inner_join  1.8218  1.91510  2.058864  2.0572  2.16205   3.0157   100
      dplyr_tidyr 13.8107 14.15735 16.020262 14.3571 14.78975 127.9654   100
         base_agg  2.3393  2.51215  2.586652  2.5804  2.63865   3.4415   100

n_IDs <- 1E3
n_Sims <- 1E5
ID_big <- sample(1:n_IDs, n_Sims, replace = TRUE)
Product_big <- sample(1:n_Sims, n_Sims, replace = TRUE)

: seconds
             expr      min       lq      mean    median        uq       max neval
           dt_way 1.633111 1.904460  1.998192  1.986452  2.110937  2.308671    10
 dplyr_inner_join 5.606322 6.361026  6.574015  6.606423  6.839273  7.198770    10
      dplyr_tidyr 8.385418 9.350730 10.127512 10.372830 10.675809 11.462403    10

数据:

ID <- c(1,1,2,2,2,3,3,4)
Product <- c(1,2,2,3,4,1,2,3)

dt <- data.table(ID, Product)
tib <- tibble(ID, Product)
df <- data.frame(ID, Product)