I have used nltk to tokenize some arabic text



(u'an arabic character/word', '``')or(u'an arabic character/word', ':')


However, they do not provide the `` or : in the documentation.


hence i would like to find out what is this

from nltk.toeknize.punkt import PunktWordTokenizer

z = "أنا تسلق شجرة"
tkn = PunkWordTokenizer
sen = tkn.tokenize(z)
tokens = nltk.pos_tag(sent)

print tokens


The default NLTK POS tag is trained on English texts and is supposedly for English text processing, see http://www.nltk.org/_modules/nltk/tag.html. The docs:

from nltk.data import load

# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
这对我来说可以让Stanford工具在Ubuntu 14.4.1上的python中工作:

This works for me to get Stanford tools working in python on Ubuntu 14.4.1:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python


from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")


$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果在使用Stanford POS标记器时遇到Java问题,请参见DELPH-IN Wiki: http://moin .delph-in.net/ZhongPreprocessing

If you have java problems when using Stanford POS tagger, see DELPH-IN wiki: http://moin.delph-in.net/ZhongPreprocessing

