我想将类似的文本分组到日志文件中
输入示例:

user_id:1234 image_id:1234 seq:1: failed to upload data
user_id:12 image_id:234 seq:2: failed to upload data
user_id:34 image_id:123 seq:3: failed to upload data
fail processing data for user_id:12 image_23
fail processing data for user_id:12 image_23

预期产量:
user_id:____ image_id:____ seq:_ failed to upload data -> 3
fail processing data for user_id:__ image___ -> 2

我尝试的是使用python sequencematcher:(伪代码)
sms = {}
for err in errors:
    for pattern in errors_map.keys():
        # SequencMatcher caches information gathered about second sequence:
        sms.setdefault(pattern, SequenceMatcher(b=pattern, autojunk=False))
        s = sms[pattern]
        s.set_seq1(err)

        # check some thresshold
        if s.quick_ratio() <= similarity_threshold:
            continue
        matching_blocks = s.get_matching_blocks()

        # if ratio >= similarity_threshold,
        # and the first matching block is located at the beginning,
        # and the size of the first matching block > 10,
        # construct the whole string & replace non matching word with _
        if matching_blocks[0][0] == 0 and matching_blocks[0][1] == 0 and matching_blocks[0][2] > 10:
            mblocks = []
            prev_a = prev_l = 0
            for a, b, l in matching_blocks:
                if l > 0:
                    if prev_l > 0:
                        len_non_matching = len(err[prev_a + prev_l:a])
                        mblocks.append('_' * len_non_matching)
                    mblocks.append(err[a:a + l])
                    prev_a = a
                    prev_l = l
            mblocks = ''.join(mblocks)

结果没那么好。我想知道是否有更好的方法或图书馆已经这样做了?

最佳答案

一种方法是对字符串进行聚类,然后在每个聚类中搜索Longest Common Subsequence
在一般情况下,您可以使用Levenshtein distance进行聚类(取决于您具有k-means或dbscan的假设)。计算LCS将是
NP难,如果你没有其他假设你的字符串。
一个更近似的算法只能查看令牌(“,”,“1234”,“……”)。使用基于设置的距离(例如Jaccard Index)。然后放松LCSS,你可以在每个集群中寻找最常见的令牌。

关于string - 通过保持相同顺序合并相似文本的算法(例如:重复数据删除日志文件),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51663653/

10-13 05:51