在Scala和Spark文本词形还原最简单的方法

本文介绍了在Scala和Spark文本词形还原最简单的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在一个文本文件中使用的词形还原：

I want to use lemmatization on a text file:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .
...

和预期成果是：

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .
...

有人能帮助我吗？谁知道，它已在斯卡拉得到落实，火花词形还原最简单的方法？

Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?

推荐答案

有是从书功能Adavanced analitics在Spark中，大约词形还原章：

There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:

  val plainText =  sc.parallelize(List("Sentence to be precessed."))

  val stopWords = Set("stopWord")

  import edu.stanford.nlp.pipeline._
  import edu.stanford.nlp.ling.CoreAnnotations._
  import scala.collection.JavaConversions._

  def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
  lemmatized.foreach(println)

现在只需要使用这个在映射的每一行。

Now just use this for every line in mapper.

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

编辑：

我添加到code线

import scala.collection.JavaConversions._

这是必要的，因为否则的句子是Java不Scala的列表。现在，这应该编译没有任何问题。

this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.

我用斯卡拉2.10.4及撂荒stanford.nlp依赖关系：

I used scala 2.10.4 and fallowing stanford.nlp dependencies:

<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
  <classifier>models</classifier>
</dependency>

您也可以看看stanford.nlp页面有很多的例子（在Java中）

You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.

编辑：

MapPartition版本：

MapPartition version:

虽然我不知道它会显著加快工作。

Although i dont know if its gonna speed up job significantly.

  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.mapPartitions(p => {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    p.map(q => plainTextToLemmas(q, stopWords, pipeline))
  })
  lemmatized.foreach(println)

这篇关于在Scala和Spark文本词形还原最简单的方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！