我想对一些意大利语文本进行词法化处理,以便对单词进行频率计数,并对这种词法化后的内容的输出进行进一步的调查。
我更喜欢使用词组比词干,因为我可以从句子的上下文中提取词的含义(例如,区分动词和名词)并获取语言中存在的词,而不是通常没有这些词的词根一个意义。
我发现了这个名为pattern
(pip2 install pattern
)的库,该库应该对nltk
进行补充,以对意大利语言进行词义化,但是我不确定以下方法是否正确,因为每个单词都是由自身词义化的,而不是在a上下文中句子。
可能我应该给pattern
负责标记一个句子的标记(因此也要用有关动词/名词/形容词的元数据对每个单词进行注释),然后检索经过修饰的单词,但我无法做到这一点,我甚至不确定目前可能吗?
另外:在意大利语中,有些文章带有撇号,因此“l'appartamento”(英语为“the flat”)实际上是2个单词:“lo”和“appartamento”。现在,我无法找到一种通过nltk
和pattern
组合拆分这两个单词的方法,因此我无法以正确的方式计算单词的频率。
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
给出以下输出:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
pattern
的分词器有效地使某些句子脱句? (假设引理被识别为名词/动词/形容词等)pattern
的python替代品,可用于nltk
的意大利词条化? 最佳答案
我会尽力回答您的问题,因为我对意大利语并不了解!
1)据我所知,删除撇号的主要责任是 token 生成器,因此nltk
意大利语 token 生成器似乎已失败。
3)一个简单的方法就是调用replace
方法(尽管对于更复杂的模式,您可能必须使用re
包),例如:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
它产生:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2)模式的一种替代方法是
treetagger
,因为它不是所有方法中最简单的安装(您需要python package和tool itself,但是在这部分之后,它可以在Windows和Linux上运行)。一个简单的例子,上面的例子:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
pprint
产生:[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
在定形化之前,它还可以很好地将
all'ippodromo
标记为al
和ippodromo
(希望是正确的)。现在,我们只需要应用停用词和标点符号即可,这样就可以了。python的The doc for installing the TreeTaggerWrapper库
关于python-2.7 - 使意大利语句子合法化以进行频率计数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45403390/