我一直在根据包含自由文本的列来处理一些数据。我想从此文本中识别出一组特定的字符串,创建一列以指定一个匹配项,如果特定字段中有多个字符串匹配项,则复制一行。我已经做到了这一点(对不感到喜庆的人表示歉意):

#Example dataframe
require(stringr)
dats<-data.frame(ID=c(1:5),text=c("rudolph","rudolph the","rudolph the red","rudolph the red nosed","rudolph the red nosed reindeer"))
    dats

#Regular expression
patt<-c("rudolph","the","red","nosed","reindeer")
    reg.patt<-paste(patt,collapse="|")
    dats$matched<-str_extract_all(dats$text,reg.patt,simplify=TRUE) %>% unlist()

#Re-shape data
dats2<-data.frame("ID"=dats$ID, "text"=dats$text,"match1"=dats$match[,1],"match2"=dats$match[,2],"match3"=dats$match[,3],"match4"=dats$match[,4],"match5"=dats$match[,5])
    dats3<-melt(dats2,id.vars=c("ID","text"))
    dats3<-dats3[dats3$value!="",]
    dats3$variable<-NULL
    dats3<-dats3[order(dats3$ID,decreasing=FALSE),]
        dats3

这绝对可以,但是我敢肯定,有一种更有效的处理方法-有人有建议吗?

圣诞节快乐!

最佳答案

尝试从cSplit包中获取splitstackshape:

library(splitstackshape)
dats$value <- lapply(str_extract_all(dats$text, reg.patt), toString)
cSplit(dats, 'value', direction="long")
# ID                           text    value
#  1:  1                        rudolph  rudolph
#  2:  2                    rudolph the  rudolph
#  3:  2                    rudolph the      the
#  4:  3                rudolph the red  rudolph
#  5:  3                rudolph the red      the
#  6:  3                rudolph the red      red
#  7:  4          rudolph the red nosed  rudolph
#  8:  4          rudolph the red nosed      the
#  9:  4          rudolph the red nosed      red
# 10:  4          rudolph the red nosed    nosed
# 11:  5 rudolph the red nosed reindeer  rudolph
# 12:  5 rudolph the red nosed reindeer      the
# 13:  5 rudolph the red nosed reindeer      red
# 14:  5 rudolph the red nosed reindeer    nosed
# 15:  5 rudolph the red nosed reindeer reindeer

关于regex - 高效编码-每个匹配项均使用R正则表达式复制行,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34399154/

10-10 16:23