假设我有这样的文字:
'he is hdajs asdas da he is not asd as da s i am a da daas you am a'
我从该文本创建了所有的二元组:
>>> bigrams_
[('he', 'is'), ('is', 'hdajs'), ('hdajs', 'asdas'), ('asdas', 'da'), ('da', 'he'), ('he', 'is'), ('is', 'not'), ('not', 'asd'), ('asd', 'as'), ('as', 'da'), ('da', 's'), ('s', 'i'), ('i', 'am'), ('am', 'a'), ('a', 'da'), ('da', 'daas'), ('daas', 'you'), ('you', 'am'), ('am', 'a')]
现在,我想创建一个新的二元组列表,其中每个二元组的第一个元素将是一个位置索引,该位置索引显示从上述格式到文本中某个点的某个二元组被查看了多少次,第二个元素将是初始列表中单词的二元组。例如,在上面的列表中,最后一个元素
('am', 'a')
已被查看2次,因此在新列表中,它将对应于此二元组:(2, ('am', 'a'))
。这样做的简洁Pythonic方法是什么。
最佳答案
您可以使用默认值为defaultdict
对象的count
,并逐步获取该键的计数器的next
值,例如:
from collections import defaultdict
from itertools import count
dd = defaultdict(lambda: count(1))
bigrams = [('he', 'is'), ('is', 'hdajs'), ('hdajs', 'asdas'), ('asdas', 'da'), ('da', 'he'), ('he', 'is'), ('is', 'not'), ('not', 'asd'), ('asd', 'as'), ('as', 'da'), ('da', 's'), ('s', 'i'), ('i', 'am'), ('am', 'a'), ('a', 'da'), ('da', 'daas'), ('daas', 'you'), ('you', 'am'), ('am', 'a')]
with_count = [(next(dd[bigram]), bigram) for bigram in bigrams]
给你:
[(1, ('he', 'is')),
(1, ('is', 'hdajs')),
(1, ('hdajs', 'asdas')),
(1, ('asdas', 'da')),
(1, ('da', 'he')),
(2, ('he', 'is')),
(1, ('is', 'not')),
(1, ('not', 'asd')),
(1, ('asd', 'as')),
(1, ('as', 'da')),
(1, ('da', 's')),
(1, ('s', 'i')),
(1, ('i', 'am')),
(1, ('am', 'a')),
(1, ('a', 'da')),
(1, ('da', 'daas')),
(1, ('daas', 'you')),
(1, ('you', 'am')),
(2, ('am', 'a'))]