问题描述
我正在尝试整理R中的以下数据集(在链接中),然后在下面运行关联规则。
I am trying to tidy the following dataset (in link) in R and then run an association rules below.
install.packages("dplyr")
library(dplyr)
df <- read.csv("Groceries (2).csv", header = F, stringsAsFactors = F, na.strings=c(""," ","NA"))
install.packages("stringr")
library(stringr)
temp1<- (str_extract(df$V1, "[a-z]+"))
temp2<- (str_extract(df$V1, "[^a-z]+"))
df<- cbind(temp1,df)
df[2] <- NULL
df[35] <- NULL
View(df)
summary(df)
str(df)
trans <- as(df,"transactions")
当我运行上面的trans<-as(df,交易)代码时,出现以下错误:
I get the following error when I run the above trans <- as(df,"transactions") code:
警告消息:
列2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22, 23、24、25、26、27、28、29、30、31、32、33、34不是逻辑或因数。应用默认离散化(请参见'?DiscretizeDF')。
Warning message:Column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 not logical or factor. Applying default discretization (see '? discretizeDF').
summary(trans)
运行上面的代码,我得到以下信息:
When I run the above code, I get the following:
transactions as itemMatrix in sparse format with
1499 rows (elements/itemsets/transactions) and
1268 columns (items) and a density of 0.01529042
most frequent items:
V5= vegetables V6= vegetables temp1=vegetables V2= vegetables
140 113 109 108
V9= vegetables (Other)
103 28490
所附结果显示所有蔬菜价值作为单独的项目显示,而不是组合蔬菜评分,这显然增加了我的专栏数。我不确定为什么会这样吗?
The attached results is showing all the vegetable values as separate items instead of a combined vegetable score which is obviously increasing my number of columns. I am not sure why this is happening?
fit<-apriori(trans,parameter=list(support=0.006,confidence=0.25,minlen=2))
fit<-sort(fit,by="support")
inspect(head(fit))
推荐答案
要强制转换为事务类,数据框必须由因子列组成。您有一个字符数据框-因此出现错误消息。数据需要进一步清理才能使其正确强制。
For coercion to transaction class the dataframe needs to be made up of factor columns. You have a dataframe of characters - hence the error message. The data requires some further cleaning in order to get it to coerce properly.
我对arules软件包不是很熟悉,但我相信read.transactions函数可能是更有用,因为它会自动丢弃重复项。我发现制作二进制矩阵并使用for循环是最容易的,但是我确信有一个更整洁的解决方案。
I'm not very familiar with the arules package but I believe the read.transactions function may be more useful as it would automatically discard duplicates. I found it easiest to make a binary matrix and use a for loop, but I am sure there is a neater solution.
直接从您的代码继续:
items <- as.character(unique(unlist(df))) # get all unique items
items <- items[which(str_detect(items, "[a-z]"))] # remove numbers
trans <- matrix(0, nrow = nrow(df), ncol = length(items))
for(i in 1:nrow(df)){
trans[i,which(items %in% t(df[i,]))] <- 1
}
colnames(trans) <- items
rownames(trans) <- temp2
trans <- as(trans, "transactions")
summary(trans)
给予
transactions as itemMatrix in sparse format with
1637 rows (elements/itemsets/transactions) and
38 columns (items) and a density of 0.3359965
most frequent items:
vegetables poultry waffles ice cream lunch meat (Other)
1058 582 562 556 555 17588
element (itemset/transaction) length distribution:
sizes
0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
102 36 8 57 51 51 71 69 63 80 79 58 84 91 72 105 97 87 114 91 82 46 30 7 4 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.00 14.00 12.77 18.00 26.00
includes extended item information - examples:
labels
1 pork
2 shampoo
3 juice
includes extended transaction information - examples:
transactionID
1 1/1/2000
2 1/1/2000
3 2/1/2000
这篇关于清洁数据和关联规则-R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!