我必须剪切 Unicode字符串字符串,它实际上是一篇文章(包含句子),我想在python中的第X句之后剪切此文章字符串。

句子结尾的一个很好的指标是它以句号(“.”)结尾,后面的单词以大写名称开头。如

myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third."

如何做到这一点?

谢谢

最佳答案

考虑下载 Natural Language Toolkit ( NLTK )。然后,您可以创建不会中断诸如“美国”之类的句子。或无法拆分以“?!”结尾的句子。

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third."
>>> sentences = nltk.sent_tokenize(paragraph)
[u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]

您的代码变得更具可读性。要访问第二个句子,请使用您习惯的符号。
>>> sentences[1]
u"And this is my second."

关于Python 在第 X 句之后剪切一个字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/3412316/

10-16 23:08