python - 计算html文件中的词组频率

我目前正在尝试适应Python，最近在我的代码中遇到了问题。我无法运行一个代码来计算一个短语在html文件中出现的次数。我最近收到了一些帮助，帮助我构建用于计算文本文件中频率的代码，但我想知道有一种方法可以直接从html文件中完成这项工作（绕过复制粘贴选项）。如有任何建议，将不胜感激。我以前使用的编码如下：

#!/bin/env python 3.3.2
import collections
import re

# Defining a function named "findWords".
def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

phcnt = collections.Counter()

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    phcnt[phrase] += 1

print(phcnt)

最佳答案

你可以使用一些str.count（一些短语）函数

In [19]: txt = 'Text mining, also referred to as text data mining, Text mining,\
         also referred to as text data mining,'
In [20]: txt.lower().count('data mining')
Out[20]: 2