问题描述
我不是一个[计算]语言学,所以请原谅我在这个话题上的晚饭.
I'm not a [computational] linguistic, so please excuse my supper dummy-ness in this topic.
根据维基百科,词形还原被定义为:
According to Wikipedia, lemmatisation is defined as:
语言学中的词形还原(或词形还原)是将单词的不同屈折形式组合在一起的过程,以便将它们作为单个项目进行分析.
现在我的问题是,集合 {am, is, are} 中任何成员的词形还原版本都应该是be"吗?如果没有,为什么不呢?
Now my question is, is the lemmatised version of any member of the set {am, is, are} supposed to be "be"? If not, why not?
第二个问题:我如何在 R 或 python 中获得它?我尝试过像 this 链接这样的方法,但没有一个给出be"给出are".我想至少为了对文本文档进行分类,这是有道理的.
Second question: How do I get that in R or python? I've tried methods like this link, but non of them gives "be" given "are". I guess at least for the purpose of classifying text documents, this makes sense to be true.
对于此处的任何给定演示,我也无法做到这一点.
I also couldn't do that with any of the given demos here.
我在做什么/假设错了?
What am I doing/assuming wrong?
推荐答案
所以这里有一种在 R 中实现的方法,使用 Northwestern University lemmatizer,MorphAdorner.
So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.
lemmatize <- function(wordlist) {
get.lemma <- function(word, url) {
response <- GET(url,query=list(spelling=word,standardize="",
wordClass="",wordClass2="",
corpusConfig="ncf", # Nineteenth Century Fiction
media="xml"))
content <- content(response,type="text")
xml <- xmlInternalTreeParse(content)
return(xmlValue(xml["//lemma"][[1]]))
}
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
return(sapply(wordlist,get.lemma,url=url))
}
words <- c("is","am","was","are")
lemmatize(words)
# is am was are
# "be" "be" "be" "be"
我怀疑您知道,正确的词形还原需要了解词类(词性)、上下文正确的拼写,还取决于使用的语料库.
As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.
这篇关于R 或 python 中的词形还原器(是,是,是 -> 是?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!