python - 在python中查找字符串中存在的相似文本

我有一个包含文本的txt文件

  目录

  前言1

  第1章：标记文本和WordNet基础7

  将文本标记为句子8

  将句子标记为单词10

  使用正则表达式标记句子12

如果我的字符串是：

input = "Tokenzing sentence using expressions"

我曾想过使用开头和结尾的单词来提取句子，但是有很多重复。

那么获得输出的最佳方法是什么

使用正则表达式标记句子

最佳答案

如果您准备对章节标题进行预处理，以消除页码和其他内容，则可以：

import difflib
contents = ["Tokenizing Text and WordNet Basics",
            "Tokenizing text into sentences",
            "Tokenizing sentences into words",
            "Tokenizing sentences using regular expressions"]
input = "Tokenzing sentence using expressions"
print (difflib.get_close_matches(input, contents, n=1))

将为您提供以下输出：

['Tokenizing sentences using regular expressions']