WN POS标签集包括'a'=形容词/副词,'s'=卫星形容词,'n'=名词和'v'=动词.尝试:>>> from nltk import word_tokenize, pos_tag>>> from nltk.corpus import wordnet as wn>>> text = 'this is a pos tagset in some foo bar paradigm'>>> pos_tag(word_tokenize(text))[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]>>> for tok, pos in pos_tag(word_tokenize(text)):... pos = pos[0].lower()... if pos in ['a', 'n', 'v']:... wn.synsets(tok, pos)...[Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')][Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')][][][Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')][Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')]I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class.ExampleWordnet.synsets() requires that the POS you give it is one of n,v,a,r, like sown.synsets("dog", POS="n,v,a,r")But a standard POS tagging from upenn_treebank looks like JJ, VBD, VBZ, etc.So I'm looking for a good way to convert between the two.Does anyone know of a good way to make this conversion happen, besides brute force? 解决方案 If textblob is using the PennTreeBank (ptb) tagset, then just use the first character in the POS tag to map to the WN pos tag.WN POS tagset includes 'a' = adjective/adverbs, 's'=satelite adjective, 'n' = nouns and 'v' = verbs.try:>>> from nltk import word_tokenize, pos_tag>>> from nltk.corpus import wordnet as wn>>> text = 'this is a pos tagset in some foo bar paradigm'>>> pos_tag(word_tokenize(text))[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]>>> for tok, pos in pos_tag(word_tokenize(text)):... pos = pos[0].lower()... if pos in ['a', 'n', 'v']:... wn.synsets(tok, pos)...[Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')][Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')][][][Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')][Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')] 这篇关于将POS标签从TextBlob转换为Wordnet兼容的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-07 00:43