我想将类似的文本分组到日志文件中
输入示例:
user_id:1234 image_id:1234 seq:1: failed to upload data
user_id:12 image_id:234 seq:2: failed to upload data
user_id:34 image_id:123 seq:3: failed to upload data
fail processing data for user_id:12 image_23
fail processing data for user_id:12 image_23
预期产量:
user_id:____ image_id:____ seq:_ failed to upload data -> 3
fail processing data for user_id:__ image___ -> 2
我尝试的是使用python sequencematcher:(伪代码)
sms = {}
for err in errors:
for pattern in errors_map.keys():
# SequencMatcher caches information gathered about second sequence:
sms.setdefault(pattern, SequenceMatcher(b=pattern, autojunk=False))
s = sms[pattern]
s.set_seq1(err)
# check some thresshold
if s.quick_ratio() <= similarity_threshold:
continue
matching_blocks = s.get_matching_blocks()
# if ratio >= similarity_threshold,
# and the first matching block is located at the beginning,
# and the size of the first matching block > 10,
# construct the whole string & replace non matching word with _
if matching_blocks[0][0] == 0 and matching_blocks[0][1] == 0 and matching_blocks[0][2] > 10:
mblocks = []
prev_a = prev_l = 0
for a, b, l in matching_blocks:
if l > 0:
if prev_l > 0:
len_non_matching = len(err[prev_a + prev_l:a])
mblocks.append('_' * len_non_matching)
mblocks.append(err[a:a + l])
prev_a = a
prev_l = l
mblocks = ''.join(mblocks)
结果没那么好。我想知道是否有更好的方法或图书馆已经这样做了?
最佳答案
一种方法是对字符串进行聚类,然后在每个聚类中搜索Longest Common Subsequence。
在一般情况下,您可以使用Levenshtein distance进行聚类(取决于您具有k-means或dbscan的假设)。计算LCS将是
NP难,如果你没有其他假设你的字符串。
一个更近似的算法只能查看令牌(“,”,“1234”,“……”)。使用基于设置的距离(例如Jaccard Index)。然后放松LCSS,你可以在每个集群中寻找最常见的令牌。
关于string - 通过保持相同顺序合并相似文本的算法(例如:重复数据删除日志文件),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51663653/