我有这样的DF:
ID Product
1 1
1 2
2 2
2 3
2 4
3 1
3 2
4 3
现在我需要在R中提供一个像这样的可能性:
finding products that customers bought together
我的输出应如下所示:
ProductX ProductY Times
1 2 2
2 3 1
3 4 1
尼斯将是一个矩阵输出,例如:
[,1] [,2] [,3] [,4]
[1,] 1 2 0 0
[2,] 2 1 1 0
[3,] 0 0 1 1
[4,] 0 0 1 1
我已经尝试过reshape2包,但我不知道如何获取此输出形式。
最佳答案
这是使用data.table和non-equi自联接的答案。 编辑:添加了allow.cartesian = TRUE
以使其适用于大型数据集。
library(data.table)
dt[dt
, on = .(ID = ID, Product < Product)
, .(ProductX = x.Product, ProductY = i.Product)
, nomatch = 0L
, allow.cartesian = TRUE
][, .N, by = .(ProductX, ProductY)]
ProductX ProductY N
1: 1 2 2
2: 2 3 1
3: 2 4 1
4: 3 4 1
相当于Dplyr:
library(dplyr)
inner_join(tib, tib, by = 'ID')%>%
filter(Product.x < Product.y)%>%
count(Product.x, Product.y)
Product.x Product.y n
<dbl> <dbl> <int>
1 1 2 2
2 2 3 1
3 2 4 1
4 3 4 1
这也是基本的R版本:
aggregate(ID ~ Product.x + Product.y
, data = merge(df, df, by = 'ID')
, subset = Product.x < Product.y
, FUN = length)
# Need to change the names from ID to n
Product.x Product.y ID
1 1 2 2
2 2 3 1
3 2 4 1
4 3 4 1
性能:
Unit: milliseconds
expr min lq mean median uq max neval
dt_way 3.9149 4.29330 4.593209 4.6597 4.80210 6.2326 100
dplyr_inner_join 1.8218 1.91510 2.058864 2.0572 2.16205 3.0157 100
dplyr_tidyr 13.8107 14.15735 16.020262 14.3571 14.78975 127.9654 100
base_agg 2.3393 2.51215 2.586652 2.5804 2.63865 3.4415 100
n_IDs <- 1E3
n_Sims <- 1E5
ID_big <- sample(1:n_IDs, n_Sims, replace = TRUE)
Product_big <- sample(1:n_Sims, n_Sims, replace = TRUE)
: seconds
expr min lq mean median uq max neval
dt_way 1.633111 1.904460 1.998192 1.986452 2.110937 2.308671 10
dplyr_inner_join 5.606322 6.361026 6.574015 6.606423 6.839273 7.198770 10
dplyr_tidyr 8.385418 9.350730 10.127512 10.372830 10.675809 11.462403 10
数据:
ID <- c(1,1,2,2,2,3,3,4)
Product <- c(1,2,2,3,4,1,2,3)
dt <- data.table(ID, Product)
tib <- tibble(ID, Product)
df <- data.frame(ID, Product)