python - 在Python数据集中搜索单词模式

我希望我能把这个问题解释清楚。我是一个python实验者（以防下面的查询看起来很幼稚）
假设我有一个表单的数据集：

a = ( ('309','308','308'), ('309','308','307'), ('308', '309','306', '304'))

让我把每个('309','308','308')都称为路径。
我想查一下：
a.Count('309','308', <any word>)
b.Count('309',<any word>,'308')
以及所有可能的排列。
我想这是某种正则表达式，可以帮助我完成这个搜索。而且，我有50000条路。
有人能建议我如何用python进行这种操作吗？我探索了特里基，但我认为那对我没有帮助。
谢谢，
萨加尔

最佳答案

您可以使用collections.Counter来执行此操作：

>>> from collections import Counter
>>> a = ( ('309','308','308'), ('309','308','307'), ('308', '309','306', '304'))
>>> Counter((x, y) for (x, y, *z) in a)
Counter({('309', '308'): 2, ('308', '309'): 1})
>>> Counter((x, z) for (x, y, z, *w) in a)
Counter({('308', '306'): 1, ('309', '308'): 1, ('309', '307'): 1})

我还在这里使用扩展元组解包，它不存在Python 3 .x，只有当你有不确定长度的元组时才需要。在Python2.x中，您可以改为：

Counter((item[0], item[1]) for item in a)

不过，我不能说这有多有效。我认为这不应该是坏事。
ACounter具有类似于dict的语法：

>>> count = Counter((x, y) for (x, y, *z) in a)
>>> count['309', '308']
2

编辑：您提到它们的长度可能大于1，在这种情况下，您可能会遇到问题，因为如果它们短于所需的长度，它们将无法解包。解决方案是将生成器表达式更改为忽略任何非必需格式的表达式：

Counter((item[0], item[1]) for item in a if len(item) >= 2)

例如：

>>> a = ( ('309',), ('309','308','308'), ('309','308','307'), ('308', '309','306', '304'))
>>> Counter((x, y) for (x, y, *z) in a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/collections.py", line 460, in __init__
    self.update(iterable, **kwds)
  File "/usr/lib/python3.2/collections.py", line 540, in update
    _count_elements(self, iterable)
  File "<stdin>", line 1, in <genexpr>
ValueError: need more than 1 value to unpack
>>> Counter((item[0], item[1]) for item in a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/collections.py", line 460, in __init__
    self.update(iterable, **kwds)
  File "/usr/lib/python3.2/collections.py", line 540, in update
    _count_elements(self, iterable)
  File "<stdin>", line 1, in <genexpr>
IndexError: tuple index out of range
>>> Counter((item[0], item[1]) for item in a if len(item) >= 2)
Counter({('309', '308'): 2, ('308', '309'): 1})

如果需要可变长度计数，最简单的方法是使用列表切片：

start = 0
end = 2
Counter(item[start:end] for item in a if len(item) >= start+end)

当然，这只适用于连续运行，如果要单独拾取列，则需要做更多的工作：

def pick(seq, indices):
    return tuple([seq[i] for i in indices])

columns = [1, 3]
maximum = max(columns)
Counter(pick(item, columns) for item in a if len(item) > maximum)

关于python - 在Python数据集中搜索单词模式，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/10243428/