我使用quanteda::textmodel_NB创建了一个将文本分为两类之一的模型。我将模型拟合到去年夏天的训练数据集上。

现在,我想在今年夏天使用它来对我们在这里工作的新文本进行分类。我尝试这样做,并收到以下错误:

Error in predict.textmodel_NB_fitted(model, test_dfm) :
feature set in newdata different from that in training set

函数中生成错误can be found here at lines 157 to 165.的代码

我认为发生这种情况是因为训练数据集中的单词与测试数据集中使用的单词不完全匹配。但是为什么会发生此错误?我觉得该模型应该能够处理包含不同功能的数据集,这在实际示例中很有用,因为这可能在应用程序中经常发生。

所以我的第一个问题是:

1.此错误是朴素贝叶斯算法的属性吗?还是函数作者选择了执行此操作?

然后,这引出我的第二个问题:

2.如何解决此问题?

为了解决第二个问题,我提供了可复制的代码(最后一行生成上面的错误):
library(quanteda)
library(magrittr)
library(data.table)

train_text <- c("Can random effects apply only to categorical variables?",
                "ANOVA expectation identity",
                "Statistical test for significance in ranking positions",
                "Is Fisher Sharp Null Hypothesis testable?",
                "List major reasons for different results from survival analysis among different studies",
                "How do the tenses and aspects in English correspond temporally to one another?",
                "Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
                "Are collective nouns always plural, or are certain ones singular?",
                "What’s the rule for using “who” and “whom” correctly?",
                "When is a gerund supposed to be preceded by a possessive adjective/determiner?")

train_class <- factor(c(rep(0,5), rep(1,5)))

train_dfm <- train_text %>%
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

model <- textmodel_NB(train_dfm, train_class)

test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
               "What do significance tests for adjusted means tell us?",
               "How should I punctuate around quotes?",
               "Should I put a comma before the last item in a list?")

test_dfm <- test_text %>%
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

predict(model, test_dfm)

我唯一想做的就是手动使功能相同(我假设这将为对象中不存在的功能填写0),但这会产生新的错误。上面示例的代码是:
model_features <- model$data$x@Dimnames$features # gets the features of the training data

test_features <- test_dfm@Dimnames$features # gets the features of the test data

all_features <- c(model_features, test_features) %>% # combining the two sets of features...
  subset(!duplicated(.)) # ...and getting rid of duplicate features

model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features

predict(model, dfm) # new error?

但是,此代码生成一个新错误:
Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") :
  argument is of length zero

如何将这种朴素的贝叶斯模型应用于具有不同功能的新数据集?

最佳答案

幸运的是,有一个简单的方法可以执行此操作:您可以在测试数据上使用dfm_select()为训练集提供相同的功能(和功能顺序)。就这么简单:

test_dfm <- dfm_select(test_dfm, train_dfm)
predict(model, test_dfm)
## Predicted textmodel of type: Naive Bayes
##
##             lp(0)       lp(1)     Pr(0)  Pr(1) Predicted
## text1  -0.6931472  -0.6931472    0.5000 0.5000         0
## text2 -11.8698712 -13.1879095    0.7889 0.2111         0
## text3  -4.1484118  -3.6635616    0.3811 0.6189         1
## text4  -8.0091415  -8.4257356    0.6027 0.3973         0

关于r - Quanteda软件包,朴素贝叶斯:如何预测不同功能的测试数据?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/44136757/

10-12 13:59
查看更多