本文介绍了将交易格式数据集R转换为购物篮格式以进行序列挖掘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
原始表
CELL NUMBER ----------ACTIVITY--------TIME<br/>
001................................call a................12.23<br/>
002................................call b................01.00<br/>
002................................call d................01.09<br/>
001................................call b................12.25<br/>
003................................call a................12.23<br/>
002................................call a................02.07<br/>
003................................call b................12.25<br/>
必填-
要从大小为400,000的数据集中挖掘ACTIVITY的最高发生顺序
To mine the highest occurring sequence of ACTIVITY from a data-set of size 400,000
应显示以上示例
[call a-12.23,call b-12.25] frequency 2<br/>
[call b-01.00,call d-01.09,call a-02.07] frequency 1
我知道可以使用arulesSequences
来实现.我需要对数据集进行哪些转换以及如何使用arulesSequences
包?
I'm aware that this can be achieved using arulesSequences
. What transformations on dataset do i need to carry out and how so as to use the arulesSequences
package?
当前db格式-具有3列的事务,如上面的示例.
Current db format- transaction with 3 columns like sample above.
推荐答案
df<-read.table(header=T,sep="|",text="CELL NUMBER|ACTIVITY|TIME
001|call a|12.23
002|call b|01.00
002|call d|01.09
001|call b|12.25
003|call a|12.23
002|call a|02.07
003|call b|12.25")
require(plyr) # for count() function
freqs<-count(df[,-1]) # [,-1] to exclude the CELL NUMBER column from the group
freqs[order(-freqs$freq),]
ACTIVITY TIME freq
2 call a 12.23 2
4 call b 12.25 2
1 call a 2.07 1
3 call b 1.00 1
5 call d 1.09 1
编辑-像这样更新:
unique(ddply(freqs,.(-freq),summarise,calls=paste0("[",paste0(paste0(ACTIVITY,"-",TIME),collapse=","),"]","frequency",freq)))
# -freq calls
#1 -2 [call a-12.23,call b-12.25]frequency2
#3 -1 [call a-2.07,call b-1,call d-1.09]frequency1
这篇关于将交易格式数据集R转换为购物篮格式以进行序列挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!