如何从重复值列表中获得唯一值集

本文介绍了如何从重复值列表中获得唯一值集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要解析一个大的日志文件(平面文件)，其中包含两列值(column-A，column-B).

I need to parse a large log file (flat file), which contains two column of values (column-A , column-B).

两列中的值都是重复的.我需要为A列中的每个唯一值查找，我需要找到一组B列值.

Values in both columns are repeating. I need to find for each unique value in column-A , I need to find a set of column-B values.

这是可以使用unix shell命令完成还是需要编写任何perl或python脚本?有什么方法可以做到?

Is this can be done using unix shell command or need to write any perl or python script? What are the ways this can be done?

xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4

输出:

xxxA - 2,1,3
xxxB - 2
xxxC - 3
xxxD - 4

推荐答案

我将使用Python字典，其中字典键是A列值，而字典值是Python内置的设置类型保留B列值

I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values

def parse_the_file():
    lower = str.lower
    split = str.split
    with open('f.txt') as f:
        d = {}
        lines = f.read().split('\n')
        for A,B in [split(l) for l in lines]:
            try:
                d[lower(A)].add(B)
            except KeyError:
                d[lower(A)] = set(B)

        for a in d:
            print "%s - %s" % (a,",".join(list(d[a])))

if __name__ == "__main__":
    parse_the_file()

使用字典的优点是，每列A值只有一个字典键.使用集合的优点是您将拥有一组唯一的B列值.

The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.

效率说明:

使用try-catch比使用if \ else语句检查初始情况更有效.
在循环外部对str函数进行求值和赋值比在循环内部简单地使用它们更有效.
取决于整个文件中新A值的比例与A值的重新出现，您可以考虑在try catch语句之前使用a = lower(A)
我使用了一个函数，因为在Python中访问局部变量比访问全局变量更有效
其中一些性能提示来自此处

The use of try-catch is more efficient than using an if\else statement to check for initial cases.
The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using a = lower(A) before the try catch statement
I used a function, as accessing local variables is more efficient in Python than accessing global variables
Some of these performance tips are from here

在您的输入示例中测试上面的代码会产生:

Testing the code above on your input example yields:

xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3

这篇关于如何从重复值列表中获得唯一值集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！