我有一些从pdf刮取的文本,并且已经解析出该文本,并且目前所有内容都以字符串的形式存在于列表中。我想将由于pdf页面中断而作为单独字符串返回的句子连接在一起。例如,

list = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']


我想拥有:

list = ['I am a sentence.', 'Please join me together. Thanks for your help.']


我目前有以下代码,该代码连接了一些句子,但仍返回了与第一个句子连接的第二个子句子。我知道这是由于建立索引引起的,但不确定如何解决此问题。

new = []

def cleanlist(dictlist):
    for i in range(len(dictlist)):

    if i>0:

        if dictlist[i-1][-1:] != ('.') or dictlist[i-1][-1:] != ('. '):
            new.append(dictlist[i-1]+dictlist[i])

        elif dictlist[i-1][-1:] == '-':
            new.append(dictlist[i-1]+dictlist[i])

        else:
            new.append[dict_list[i]]

最佳答案

您可以使用生成器方法:

def cleanlist(dictlist):
    current = []
    for line in dictlist:
        if line.endswith("-"):
            current.append(line[:-1])
        elif line.endswith(" "):
            current.append(line)
        else:
            current.append(line)
            yield "".join(current)
            current = []


像这样使用它:

dictlist = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
print(list(cleanlist(dictlist)))
# ['I am a sentence.', 'Please join me together. Thanks for your help.']

关于python - 将解析的pdf中的句子连接在一起,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50337660/

10-15 16:17