问题描述
当我计算(m)个训练示例的每个训练数据之间的jaccard相似度时,每个具有6个特征(年龄,职业,性别,Product_range,Product_cat和Product)形成一个(m * m)相似度矩阵.
When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.
对于矩阵我得到了不同的结果.我已经确定了问题的根源,但没有针对该问题提出优化的解决方案.
I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.
找到以下数据集的样本:
Find the sample of the dataset below:
ID AGE Occupation Gender Product_range Product_cat Product
1100 25-34 IT M 50-60 Gaming XPS 6610
1101 35-44 Research M 60-70 Business Latitude lat6
1102 35-44 Research M 60-70 Performance Inspiron 5810
1103 25-34 Lawyer F 50-60 Business Latitude lat5
1104 45-54 Business F 40-50 Performance Inspiron 5410
我得到的矩阵是
Problem Statement:
如果在红色框下看到的值表示样本数据集的行(1104)和(1101)的相似性.如果查看它们各自的列,则这两行并不相似,但是值0.16是因为在行(1104)的职业"列和行(1101)的"product_cat"列中存在术语业务"当采取行的交点时,结果为1.
If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.
我的代码只使用两行的交集而无需查看列,我该如何更改代码以处理这种情况并保持同样好的性能.
My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.
My code:
half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
intersection_len = row1.intersection(row2)
half_matrix.append(float(len(intersection_len)) /tot_len)
推荐答案
最简单的方法是为所有条目添加特定于列的前缀.解析行的示例:
The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row:
row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"]
还有很多其他方法,包括将每一行分成一组k-mers并应用基于Jaccard的MinHash算法来比较这些集合,但是在您的情况下无需这样做.
There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based MinHash algorithm to compare these sets, but there is no need in such a thing in your case.
这篇关于如何与itertools进行列明智的相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!