问题描述
我有一个数据框,例如:
动物ID
猫1,3,4
狗1,2, 4
仓鼠5
海豚3,5
数据帧很大,超过8万个行和ID列可能会轻易包含数千甚至10万个以逗号分隔的ID。给定行中的ID在逗号分隔的字符串中将是唯一的。
我想构建一个数据帧,该数据帧计算Jaccard的索引,即动物列中的每个项目在id中的相交
因此,如果我们看一下猫和狗,则联合为2(id 1和4),联合为4(id 1、2、3、4),因此,Jaccard的指数为2/4 = 0.5。拥有以下格式的数据集将是很棒的:
猫狗仓鼠海豚
猫1 0.5 0 0.25
狗0.5 1 0 0
仓鼠0 0 1 0.5
海豚0.25 0 0.5 1
行索引作为动物的名称,这样我就可以快速找到相关的jaccard索引,例如:
cat_dog_ji = df_new ['cat'] [ 'dog']
您可以使用 str.get_dummies
和一些 scipy
工具。
<$ p来自scipy的$ p>
。空间导入距离
u = df [ ids]。str.get_dummies(,)
j = distance.pdist (u, jaccard)
k = df [动物] .to_numpy()
pd.DataFrame(1-distance.squareform(j),index = k,columns = k)
猫狗仓鼠海豚
猫1.00 0.5 0.0 0.25
狗0.50 1.0 0.0 0.00
仓鼠0.00 0.0 1.0 0.50
海豚0.25 0.0 0.5 1.00
I have a dataframe like:
animal ids
cat 1,3,4
dog 1,2,4
hamster 5
dolphin 3,5
The dataframe is quite big, with over 80 thousand rows, and ids column may contain easily over thousands, even 10 thousands comma separated id. Ids in a given row would be unique in the comma separated string.
I would like to construct a dataframe which calculated Jaccard's index, i.e. intersection of each items in animal column with each other in ids column over union.
So if we look at cat and dog, the union is 2 (ids 1 and 4), and union is 4 (ids 1, 2, 3, 4), hence the Jaccard's index is 2/4 = 0.5. It would be great to have the dataset in this format:
cat dog hamster dolphin
cat 1 0.5 0 0.25
dog 0.5 1 0 0
hamster 0 0 1 0.5
dolphin 0.25 0 0.5 1
which means using the row index as the name of the animal, so that I can find related jaccard's index quickly like:
cat_dog_ji = df_new['cat']['dog']
You can use str.get_dummies
and some scipy
tools here.
from scipy.spatial import distance
u = df["ids"].str.get_dummies(",")
j = distance.pdist(u, "jaccard")
k = df["animal"].to_numpy()
pd.DataFrame(1 - distance.squareform(j), index=k, columns=k)
cat dog hamster dolphin
cat 1.00 0.5 0.0 0.25
dog 0.50 1.0 0.0 0.00
hamster 0.00 0.0 1.0 0.50
dolphin 0.25 0.0 0.5 1.00
这篇关于计算 pandas 数据帧中的联合的交集(Jaccard指数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!