问题描述
我想在一个文本文件中使用的词形还原:
I want to use lemmatization on a text file:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
和预期成果是:
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
有人能帮助我吗?谁知道,它已在斯卡拉得到落实,火花词形还原最简单的方法?
Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?
推荐答案
有是从书功能Adavanced analitics在Spark中,大约词形还原章:
There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
现在只需要使用这个在映射的每一行。
Now just use this for every line in mapper.
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
编辑:
我添加到code线
import scala.collection.JavaConversions._
这是必要的,因为否则的句子是Java不Scala的列表。现在,这应该编译没有任何问题。
this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.
我用斯卡拉2.10.4及撂荒stanford.nlp依赖关系:
I used scala 2.10.4 and fallowing stanford.nlp dependencies:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
您也可以看看stanford.nlp页面有很多的例子(在Java中)
You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.
编辑:
MapPartition版本:
MapPartition version:
虽然我不知道它会显著加快工作。
Although i dont know if its gonna speed up job significantly.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)
这篇关于在Scala和Spark文本词形还原最简单的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!