问题描述
我有2个堆栈溢出问题列表,分别是A组和B组.它们都有两列,Id和Tag.例如:
I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:
|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity
对于A组中的每个问题,我需要在B组中找到具有至少一个重叠标签的所有匹配问题,而与标签位置无关,在A组中查找该问题.例如,这些问题都应该是匹配的问题:
For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:
|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#
我的第一个想法是将变量Tag转换为集合变量,并使用Pandas进行合并,因为集合会忽略位置.但是,似乎Pandas不允许将set变量用作键变量.因此,我现在使用for循环在B组上进行搜索.但是,由于我在B组中进行了1300万次观察,因此速度非常慢.
My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.
我的问题是:1. Python中是否还有其他方法可以按集合的列进行合并,并且可以分辨出重叠标签的数量?2.如何提高for循环搜索的效率?
My question is:1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags?2. How to improve the efficiency of for loop search?
推荐答案
可以使用df.join
和df.groupby
来实现.
这是我正在使用的设置:
This is the setup I'm working with:
df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]})
Id Tag
0 2 [c#, winforms, type-conversion, decimal, opacity]
df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})
Id Tag
0 3 [c#]
1 4 [winforms, type-conversion]
2 5 [winforms, c#]
我们将两个数据框中的右列放平. 此帮助:
Let's flatten out the right column in both data frames. This helped:
In [2331]: from itertools import chain
In [2332]: def flatten(df):
...: return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
...: "Tag": list(chain.from_iterable(df.Tag))})
...:
In [2333]: df1 = flatten(df1)
In [2334]: df2 = flatten(df2)
In [2335]: df1.head()
Out[2335]:
Id Tag
0 2 c#
1 2 winforms
2 2 type-conversion
3 2 decimal
4 2 opacity
与df2
类似,它也被展平.
现在是魔术.我们将在Tag
列上执行join
,然后在联接的ID
上执行groupby
,以查找重叠标签的数量.
Now is the magic. We'll do a join
on Tag
column, and then groupby
on joined ID
s to find count of overlapping tags.
In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]:
Id_x Id_y Tag
0 2 3 1
1 2 4 2
2 2 5 2
输出显示每对标签以及重叠标签的数量. groupby
过滤掉没有重叠的对.
The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby
.
df.count
计算重叠的标签,而df.reset_index
只是美化输出,因为groupby
将分组的列分配为索引,所以我们将其重置.
The df.count
counts overlapping tags, and df.reset_index
just prettifies the output, since groupby
assigns the grouped column as the index, so we reset it.
要查看匹配的标签,请对上面的内容进行一些修改:
To see matching tags, you'll modify the above slightly:
In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]:
Id_x Id_y Tag
0 2 3 [c#]
1 2 4 [winforms, type-conversion]
2 2 5 [c#, winforms]
要过滤掉1个重叠,请将df.query
调用链接到第一个表达式:
To filter out 1-overlaps, chain a df.query
call to the first expression:
In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]:
Id_x Id_y Tag
1 2 4 2
2 2 5 2
这篇关于如何使用Python Pandas按集合的列进行合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!