我正在为R中的for循环寻找一种简单的矢量化方法。
我有以下数据框架,其中包含句子和两个正负词词典:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")


现在,我创建原始数据帧的副本以模拟大数据集:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL


对于下一步,我将必须对字典中的单词按其情感分数进行降序排列(正词= 1,负词= -1)。

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL


然后用for循环定义以下函数:

# Sentiment score function
scoreSentence2 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    if(count){
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
      sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
      # library(qdapRegex)
      sentence <- rm_white(sentence)
    }
  }
  score
}


我在数据框中的句子上调用前一个函数:

# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user       system    elapsed
# 1054.19    0.09      1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)


所需的输出是:

Words                                             user      SentimentScore2
just right size and i love this notebook          1         2
benefits great laptop                             2         1
wouldnt bad notebook                              3         1
very good quality                                 4         1
orgtop                                            5         0
  .
  .
  .


依此类推...

拜托,任何人都可以帮助我减少原始方法的计算时间。由于我的初学者在R语言方面的编程技能,所以我终于结束了:-)
您的任何帮助或建议将不胜感激。提前非常感谢您。

最佳答案

本着“教人钓鱼胜于提供鱼”的精神,我将引导您完成以下工作:


复制您的代码:您将把它弄乱!
找到瓶颈:

1a:将问题缩小:

Rep  <- 100
df.expanded <- as.data.frame(replicate(nRep,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]


1b:保留参考解决方案:您将要更改代码,并且几乎没有什么活动比引入代码优化更令人惊奇地引入了错误!

sentRef <- sent


并在代码末尾添加相同但注释掉的内容,以记住您的引用在哪里。为了使检查更加轻松,您不会弄乱您的代码,可以在代码末尾自动对其进行测试:

library("testthat")
expect_equal(sent,sentRef)


1c:在代码周围触发事件探查器以查看:

Rprof()
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
Rprof(NULL)


1d:以R为底,查看结果:

summaryRprof()


还有更好的工具,可以检查包
profileR
要么
线教授

线教授
是我选择的工具,在这里是真正的附加值,可以将问题缩小到这两行:

matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them

修理它。

3.1幸运的是,主要问题很容易:您不需要将第一行放在函数中,只需将其移到前面即可。顺便说一句,这同样适用于您的paste0()。您的代码变为:

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')

# Sentiment score function
scoreSentence2 <- function(sentence){
    score <- 0
    for(x in 1:nrow(wordsDF)){
        count <- length(grep(matchWords[x],sentence)) # count them
        if(count){
            score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
            sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
            require(qdapRegex)
            # sentence <- rm_white(sentence)
        }
    }
    score
}


这会将执行时间更改为1000次
5.64s至2.32s。不错的投资!

3.2下一个琴颈是“ count 影子有正确的答案:-)结合起来,我们得到:

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')

# Sentiment score function
scoreSentence2 <- function(sentence){
    score <- 0
    for(x in 1:nrow(wordsDF)){
        count <- grepl(matchWords[x],sentence) # count them
        score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
        sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
        require(qdapRegex)
        # sentence <- rm_white(sentence)
    }
    score
}



在这里,速度提高了0.18秒或31倍...

08-24 15:40
查看更多