python - 朴素贝叶斯分类器提取摘要

我正在尝试训练朴素的贝叶斯分类器，但是我在数据方面遇到了麻烦。我计划将其用于提取性文本摘要。

Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.

我有一个计划使用的数据集，并且在每个文档中至少都有一句摘要。

我决定使用sklearn，但我不知道如何表示我拥有的数据。即X和y。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)

我最想想到的就是这样：

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

y = [
        [0,1],
        [1,0]
    ]

其中目标值表示1（包含在摘要中）和0（不含）。不幸的是，这将导致不良形状异常，因为y预计为1-d数组。我想不出一种表示它的方法，请帮忙。

顺便说一句，我不直接在X中使用字符串值，而是使用sklearn中的CountVectorizer和TfidfTransformer将它们表示为向量。

最佳答案

根据您的要求，您正在对数据进行分类。这意味着，您需要将每个句子分开以预测其类别。

例如：
而不是使用：

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

使用它如下：

X = [
        'It was a sunny day.',
        'The weather was nice and the birds were singing.',
        'I like trains.',
        'Hi, again.'
    ]

使用NLTK的句子标记器可以实现此目的。

现在，对于标签，使用两类。假设1代表是，0代表否。

y = [
        [0,],
        [1,],
        [1,],
        [0,]
    ]

现在，使用这些数据来拟合和预测所需的方式！

希望能帮助到你！

关于python - 朴素贝叶斯分类器提取摘要，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/43216743/