我是新斯坦福大学NLP。我找不到任何好的且完整的文档或教程。我的工作是进行情绪分析。我有一个非常庞大的产品评论数据集。我已经根据用户给出的“开始”以正面和负面来区分它们。现在,我需要找到最常出现的正面和负面形容词作为我算法的特征。我了解如何从here进行令牌化,词形化和POS标记。我有这样的文件。

该评论是

Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic


和输出是。

Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]


我已经像这样处理了10000个正文件和10000个负文件。现在如何才能轻松找到最常出现的正面和负面特征(形容词)?我是否需要读取所有输出(已处理)文件并像这样对形容词进行频率计数,还是斯坦福大学有没有简单的方法?

最佳答案

这是处理带注释的评论并将形容词存储在Counter中的示例。

在示例中,电影评论“电影很棒!这是一部很棒的电影。”有一种“积极”的情绪。

我建议更改代码以加载到每个文件中,并使用文件的文本构建注释并记录该文件的情绪。

然后,您可以浏览每个文件,并为每个形容词建立一个带有正计数和负计数的计数器。

最终Counter的形容词“ great”为2。

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;

import java.util.Properties;

public class AdjectiveSentimentExample {

    public static void main(String[] args) throws Exception {

        Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
        Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();

        Annotation review = new Annotation("The movie was great!  It was a great film.");
        String sentiment = "positive";

        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.annotate(review);
        for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
            for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
                    if (sentiment.equals("positive")) {
                        adjectivePositiveCounts.incrementCount(cl.word());
                    } else if (sentiment.equals("negative")) {
                        adjectiveNegativeCounts.incrementCount(cl.word());
                    }
                }

            }
        }

        System.out.println("---");
        System.out.println("positive adjective counts");
        System.out.println(adjectivePositiveCounts);
    }
}

09-09 21:13
查看更多