python:删除重复的文本行组

本文介绍了python:删除重复的文本行组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道如何从文本中删除重复的行和重复的字符，但我正在尝试在 python3 中完成一些更复杂的事情.我的文本文件可能包含也可能不包含在每个文本文件中重复的行组.我想编写一个 python 实用程序，它将找到这些重复的行块并删除除第一个找到的所有行块.

I know how to remove duplicate lines and duplicate characters from text, but I'm trying to accomplish something more complicated in python3. I have text files that might or might not contain groups of lines that are duplicated within each text file. I want to write a python utility that will find these duplicate blocks of lines and remove all but the first one found.

例如，假设 file1 包含以下数据:

For example, suppose file1 contains this data:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.

我希望此转换的结果如下:

I want the following to be the result of this transformation:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.




That's all, folks.

当发现重复的行组从文件开头以外的某个地方开始时，我也希望它起作用.假设 file2 看起来像这样:

I also want this to work when the duplicate groups of lines are found starting somewhere other than at the beginning of the file. Suppose file2 looks like this:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.

All around
the mulberry bush
the monkey chased the weasel.

End

对于file2，这应该是转换的结果:

For file2, this should be the result of the transformation:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
... and this is another phrase.


End

需要明确的是，在运行此所需实用程序之前，不知道可能重复的行组.该算法必须自己识别这些重复的行组.

To be clear, the potentially duplicated groups of lines are not known before running this desired utility. The algorithm would have to identify these duplicated groups of lines, itself.

我相信只要有足够的工作和足够的时间，我最终可以想出我正在寻找的算法.但我希望有人可能已经解决了这个问题并将结果张贴在某处.我一直在寻找，没有找到任何东西，但也许我忽略了一些东西.

I'm sure that with enough work and enough time, I can eventually come up with the algorithm I'm looking for. But I'm hoping that someone might have already solved this problem and posted the results somewhere. I have been searching and haven't found anything, but perhaps I have overlooked something.

附录:我需要更清楚地说明.行组必须是最大大小的组，每个组必须至少包含 2 行.

ADDENDUM: I need to add more clarity. The groups of lines must be the largest sized groups, and each group must contain a minimum of 2 lines.

例如，假设 file3 看起来像这样:

For example, suppose file3 looks like this:

line1 line1 line1
line2 line2 line2
line3 line3 line3

other stuff

line1 line1 line1
line3 line3 line3
line2 line2 line2

在这种情况下，所需的算法不会删除任何行.

In this case, the desired algorithm will not remove any lines.

另一个例子，在file4中:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

我正在寻找的结果是:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

换句话说，由于 4 行组(带有line1 ... line2 ... line3 ... line4 ...")是最大的重复组，因此是唯一被删除的组.

In other words, since the 4-line group (with "line1 ... line2 ... line3 ... line4 ...") is the largest one that is duplicated, that is the only group that is removed.

如果我想同时删除较小的重复组，我可以一直重复该过程直到文件未更改.

I could always repeat the process until the file is unchanged, if I then want the smaller duplicate groups to also be removed.

推荐答案

我想出了以下解决方案.它可能仍然有一些无法解释的边缘情况，这可能不是最有效的方法，但至少在我的初步测试之后，它似乎有效.

I came up with the following solution. It might still have some unaccounted-for edge cases, and it might not be the most efficient way to do this, but at least after my preliminary testing, it seems to work.

这个转帖已经修复了我最初提交的版本中的一些错误.

This repost already fixes some bugs in my originally submitted version.

欢迎提出任何改进建议.

Any suggestions for improvement are welcome.

# Remove all but the first occurrence of the longest
# duplicated group of lines from a block of text.
# In this utility, a "group" of lines is considered
# to be two or more consecutive lines.
#
# Much of this code has been shamelessly stolen from
# https://programmingpraxis.com/2010/12/14/longest-duplicated-substring/

import sys

from itertools import starmap, takewhile, tee
from operator import eq, truth

# imap and izip no longer exist in python3 itertools.
# These are simply equivalent to map and zip in python3.
try:
    # python2 ...
    from itertools import imap
except ImportError:
    # python3 ...
    imap = map
try:
    # python2 ...
    from itertools import izip
except ImportError:
    # python3 ...
    izip = zip

def remove_longest_dup_line_group(text):
    if not text:
        return ''
    # Unlike in the original code, here we're dealing
    # with groups of whole lines instead of strings
    # (groups of characters). So we split the incoming
    # data into a list of lines, and we then apply the
    # algorithm to these lines, treating a line in the
    # same way that the original algorithm treats an
    # individual character.
    lines = text.split('\n')
    ld = longest_duplicate(lines)
    if not ld:
        return text
    tokens = text.split(ld)
    if len(tokens) < 1:
        # Defensive programming: this shouldn't ever happen,
        # but just in case ...
        return text
    return '{}{}{}'.format(tokens[0], ld, ''.join(tokens[1:]))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a,b)

def prefix(a, b):
    count = sum(takewhile(truth, imap(eq, a, b)))
    if count < 2:
        # Blocks must consist of more than one line.
        return ''
    else:
        return '{}\n'.format('\n'.join(a[:count]))

def longest_duplicate(s):
    suffixes = (s[n:] for n in range(len(s)))
    return max(starmap(prefix, pairwise(sorted(suffixes))), key=len)

if __name__ == '__main__':
    text = sys.stdin.read()
    if text:
        # Use sys.stdout.write instead of print to
        # avoid adding an extra newline at the end.
        sys.stdout.write(remove_longest_dup_line_group(text))
    sys.exit(0)

这篇关于python:删除重复的文本行组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！