问题描述
我在熊猫数据框中使用这种格式的数据:
I have my data in this format in a pandas dataframe:
Customer_ID Location_ID
Alpha A
Alpha B
Alpha C
Beta A
Beta B
Beta D
我想研究客户的流动性模式.我的目标是确定客户最常去的位置集群.我认为以下矩阵可以提供此类信息:
I want to study the mobility patterns of the customers. My goal is to determine the clusters of locations that are most frequented by customers. I think the following matrix can provide such information:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 0
D 1 1 0 0
如何在Python中这样做?
How do I do so in Python?
我的数据集非常大(成千上万的客户和大约一百个位置).
My dataset is quite large (hundreds of thousands of customers and about a hundred locations).
推荐答案
这里是一种考虑了多次访问的方法(例如,如果客户X两次访问LocA和LocB,他将为相应的访问者贡献2
最终矩阵中的位置.
Here is one approach that takes into account the multiplicity of visits (e.g. if Customer X visits both LocA and LocB twice, he will contribute 2
to the corresponding position in the final matrix).
想法:
- 对于每个位置,计算客户的访问次数.
- 对于每个位置对,请找出访问过这两个位置的每个客户的最小访问次数之和.
- 使用
unstack
并进行清理.
- For each location, count visits by customer.
- For each location pair, find the sum of minimal numbers of visits for each customer who visited both.
- Use
unstack
and cleanup.
Counter
在这里可以很好地发挥作用,因为计数器支持许多自然算术运算,例如add
,max
等.
Counter
plays nicely here because counters support many natural arithmetic operations, like add
, max
etc.
import pandas as pd
from collections import Counter
from itertools import product
df = pd.DataFrame({
'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
'Location_ID': ['A', 'B', 'C', 'A', 'B'],
})
ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}
# In [7]: q.ctrs
# Out[7]:
# {'A': Counter({'Alpha': 1, 'Beta': 1}),
# 'B': Counter({'Alpha': 1, 'Beta': 1}),
# 'C': Counter({'Alpha': 1})}
ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
for i, (loc1, ctr1) in enumerate(ctrs, start=1)
for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]
df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)
# Count
# Loc2 A B C
# Loc1
# A 0 2 1
# B 2 0 1
# C 1 1 0
如果您想忽略多重性,请将Counter(gp.Customer_ID)
替换为Counter(set(gp.Customer_ID))
.
If you like to disregard multiplicities, replace Counter(gp.Customer_ID)
with Counter(set(gp.Customer_ID))
.
这篇关于从Pandas数据框创建矩阵以显示连通性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!