我正在尝试创建一个算法,该算法遍历字符串列表,如果满足特定条件,则将字符串连接在一起,然后跳过其连接的字符串数,以避免重复计算同一连接字符串的部分。
我了解i = i + x或i + = x不会改变每个循环的迭代量,因此我在寻找一种替代方法,以跳过一个变量的多次迭代。
背景:我试图创建一个用于新闻文章的命名实体识别算法。我将文本('Prime Minister Jacinda Ardern is from New Zealand')
标记为('Prime','Minister','Jacinda','Ardern','is'...)
并在其上运行NLTK POS标签算法,得到:... (('Jacinda','NNP'),('Ardern','NNP'),('is','VBZ')...
然后在后续单词也是'NNP'/专有名词时组合单词。
目标是将“ Jacinda Ardern总理”计为1个字符串,而不是4个字符串,然后跳过尽可能多的单词进行循环迭代,以避免下一个字符串为“ Minister Jacinda Ardern”和“ Jacinda Ardern”。
内容:
“文本”是通过标记化然后用POS标记我的文章而创建的列表的列表,格式为:[...('She', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('roughly', 'RB'), ('25-minute', 'JJ'), ('meeting', 'NN')...]
'NNP'=专有名词或地点/人员/组织等的名称。
for (i) in range(len(text)):
print(i)
#initialising wordcounter as a variable
wordcounter = 0
# if text[i] is a Proper Noun, make namedEnt = the word.
# then increase wordcounter by 1
if text[i][1] == 'NNP':
namedEnt = text[i][0]
wordcounter +=1
# while the next word in text is also a Proper Noun,
# increase wordcounter by 1. Initialise J as = 1
while text[i + wordcounter][1] == 'NNP':
wordcounter +=1
j = 1
# While J is less than wordcounter, join text[i+j] to
# namedEnt. Increase J by 1. When that is no longer
# the case append namedEnt to a namedEntity list
while j < wordcounter:
namedEnt = ' '.join([namedEnt,text[i+j][0]])
j += 1
InitialNamedEntity.append(namedEnt)
i += wordcounter
如果我在开始时
print(i)
,则每次上升1。当我打印由namedEnts组成的NamedEntity列表的Counter时,i
结果如下:(...'New Zealand': 7, 'Zealand': 7, 'United': 4, 'Prime Minister Minister Jacinda Minister Jacinda Ardern': 3...)
因此,我不仅获得了像“新西兰”和“新西兰”那样的双重荣誉,而且还获得了像“总理哈辛达部长贾辛达·阿登”这样古怪的结果。
我想要的结果是
('New Zealand':7, 'United States':4,'Prime Minister Jacinda Ardern':3)
任何帮助将不胜感激。干杯
最佳答案
如果需要调整for
的递增方式,请不要使用i
循环,因为它总是将其设置为范围中的下一个值。使用while
循环:
i = 0
while i < len(text):
...
i += wordcounter
关于python - 在Python循环中调整迭代量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58478220/