在NLTK introduction book中,它们显示了如何使我们一致以围绕给定单词获得上下文。但是我想要更复杂的东西。我可以围绕某种模式获取文字吗?像这样:text.concordances(", [A-Za-z]+ , ")
〜所有用空格和逗号包围的单词
最佳答案
简而言之,nltk无法从其当前状态的正则表达式创建一致性。从nltk的ConcordanceIndex
类(或其子类)创建一致性的困难(这就是您正在使用的)是,该类接受标记列表作为参数(并围绕这些标记构建)而不是全文字符串。
我想我的建议是创建自己的类,该类接受字符串作为参数而不是标记。这是一个松散地基于nltk的ConcordanceIndex
类的类,该类可能作为起点:
import re
class RegExConcordanceIndex(object):
"Class to mimic nltk's ConcordanceIndex.print_concordance."
def __init__(self, text):
self._text = text
def print_concordance(self, regex, width=80, lines=25, demarcation=''):
"""
Prints n <= @lines contexts for @regex with a context <= @width".
Make @lines 0 to display all matches.
Designate @demarcation to enclose matches in demarcating characters.
"""
concordance = []
matches = re.finditer(regex, self._text, flags=re.M)
if matches:
for match in matches:
start, end = match.start(), match.end()
match_width = end - start
remaining = (width - match_width) // 2
if start - remaining > 0:
context_start = self._text[start - remaining:start]
# cut the string short if it contains a newline character
context_start = context_start.split('\n')[-1]
else:
context_start = self._text[0:start + 1].split('\n')[-1]
context_end = self._text[end:end + remaining].split('\n')[0]
concordance.append(context_start + demarcation + self._text
[start:end] + demarcation + context_end)
if lines and len(concordance) >= lines:
break
print("Displaying %s matches:" % (len(concordance)))
print '\n'.join(concordance)
else:
print "No matches"
现在您可以像这样测试该类:
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.raw(fileids='austen-emma.txt')
>>> comma_separated = RegExConcordanceIndex(emma)
>>> comma_separated.print_concordance(r"(?<=, )[A-Za-z]+(?=,)", demarcation='**') # matches are enclosed in double asterisks
Displaying 25 matches:
Emma Woodhouse, **handsome**, clever, and rich, with a comfortab
Emma Woodhouse, handsome, **clever**, and rich, with a comfortable home
The real evils, **indeed**, of Emma's situation were the power
o her many enjoyments. The danger, **however**, was at present
well-informed, **useful**, gentle, knowing all the ways of the
well-informed, useful, **gentle**, knowing all the ways of the family,
a good-humoured, **pleasant**, excellent man, that he thoroughly
"No, **papa**, nobody thought of your walking. We
"I believe it is very true, my dear, **indeed**," said Mr. Woodhouse,
should not like her so well as we do, **sir**,
e none for myself, papa; but I must, **indeed**,
met with him in Broadway Lane, **when**, because it began to drizzle,
like Mr. Elton, **papa**,--I must look about for a wife for hi
"With a great deal of pleasure, **sir**, at any time," said Mr. Knightley,
better thing. Invite him to dinner, **Emma**, and help him to the best
y. He had received a good education, **but**,
Miss Churchill, **however**, being of age, and with the full co
From the expense of the child, **however**, he was soon relieved.
It was most unlikely, **therefore**, that he should ever want his
strong enough to affect one so dear, **and**, as he believed,
It was, **indeed**, a highly prized letter. Mrs. Westo
and he had, **therefore**, earnestly tried to dissuade them
Fortunately for him, **Highbury**, including Randalls in the same par
handsome, **rich**, nor married. Miss Bates stood in th
a real, **honest**, old-fashioned Boarding-school, wher
关于python - 您可以一致地进行正则表达式吗?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23555995/