问题描述
- 如何执行 (
INNER
| (LEFT
|RIGHT
|FULL
)OUTER
)JOIN
与熊猫? - 如何在合并后为缺失的行添加 NaN?
- 合并后如何去除 NaN?
- 我可以合并索引吗?
- 如何合并多个 DataFrame?
- 与熊猫交叉连接
合并
?加入
?连接
?更新
?WHO?什么?为什么?!
……等等.我已经看到这些反复出现的问题询问熊猫合并功能的各个方面.今天关于合并及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中.这里的目的是为后代整理一些更重要的观点.
本问答旨在成为一系列关于常见 Pandas 习语的有用用户指南的下一部分(参见 ,指定右侧 DataFrame 和连接键(至少)作为参数.
left.merge(right, on='key')# 或者,如果你想明确# left.merge(right, on='key', how='inner')键值_x 值_y0 乙 0.400157 1.8675581 D 2.240893 -0.977278
这仅返回来自 left
和 right
的行,它们共享一个公共键(在本例中为B"和D").
A LEFT OUTER JOIN 或 LEFT JOIN 由
表示这可以通过指定 how='left'
来执行.
left.merge(right, on='key', how='left')键值_x 值_y0 A 1.764052 NaN1 乙 0.400157 1.8675582 C 0.978738 NaN3D 2.240893 -0.977278
请仔细注意此处 NaN 的位置.如果您指定 how='left'
,则仅使用 left
中的键,right
中缺失的数据将替换为 NaN.>
同样,对于RIGHT OUTER JOIN,或RIGHT JOIN,这是...
...指定how='right'
:
left.merge(right, on='key', how='right')键值_x 值_y0 乙 0.400157 1.8675581 D 2.240893 -0.9772782 E NaN 0.9500883 F NaN -0.151357
这里使用了 right
的键,left
的缺失数据被 NaN 替换.
最后,对于FULL OUTER JOIN,由
给出指定how='outer'
.
left.merge(right, on='key', how='outer')键值_x 值_y0 A 1.764052 NaN1 乙 0.400157 1.8675582 C 0.978738 NaN3D 2.240893 -0.9772784 E NaN 0.9500885 F NaN -0.151357
这使用了两个帧中的键,并且为丢失的行插入了 NaN.
文档很好地总结了这些不同的合并:
其他联接 - 左排除、右排除和完全排除/反联接
如果您需要分两步排除左连接和排除右连接.
对于LEFT-Excluded JOIN,表示为
首先执行 LEFT OUTER JOIN,然后过滤(排除!)来自 left
的行,
(left.merge(right, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', 1))键值_x 值_y0 A 1.764052 NaN2 C 0.978738 NaN
哪里,
left.merge(right, on='key', how='left', indicator=True)键值_x 值_y _merge0 A 1.764052 NaN left_only1 B 0.400157 1.867558 两者2 C 0.978738 NaN left_only3 D 2.240893 -0.977278 两者
类似地,对于 RIGHT-Excluded JOIN,
(left.merge(right, on='key', how='right', indicator=True).query('_merge == "right_only"').drop('_merge', 1))键值_x 值_y2 E NaN 0.9500883 F NaN -0.151357
最后,如果您需要进行合并,只保留左侧或右侧的键,但不能同时保留两者(IOW,执行 ANTI-JOIN),
你可以用类似的方式做到这一点——
(left.merge(right, on='key', how='outer', indicator=True).query('_merge != "both"').drop('_merge', 1))键值_x 值_y0 A 1.764052 NaN2 C 0.978738 NaN4 E NaN 0.9500885 F NaN -0.151357
键列的不同名称
如果键列的名称不同——例如,left
有 keyLeft
,right
有 keyRight
而不是 key
——那么你必须指定 left_on
和 right_on
作为参数而不是 on
:
left2 = left.rename({'key':'keyLeft'},axis=1)right2 = right.rename({'key':'keyRight'},axis=1)左2键左值0 A 1.7640521 乙 0.4001572 C 0.9787383D 2.240893右2键值0 乙 1.8675581 D -0.9772782 E 0.9500883 楼 -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')keyLeft value_x keyRight value_y0 乙 0.400157 乙 1.8675581 D 2.240893 D -0.977278
避免输出中重复的键列
当在 left
的 keyLeft
和 right
的 keyRight
上合并时,如果您只想要 keyLeft
或 keyRight
(但不是两者),您可以先设置索引作为初步步骤.
left3 = left2.set_index('keyLeft')left3.merge(right2, left_index=True, right_on='keyRight')value_x keyRight value_y0 0.400157 乙 1.8675581 2.240893 D -0.977278
将此与之前命令的输出(即 left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner') 的输出进行对比)
),您会注意到 keyLeft
丢失了.您可以根据将哪个帧的索引设置为键来确定要保留的列.例如,在执行某些 OUTER JOIN 操作时,这可能很重要.
仅合并 DataFrames
中的一列例如,考虑
right3 = right.assign(newcol=np.arange(len(right)))右3键值 newcol0 乙 1.867558 01 D -0.977278 12 E 0.950088 23 F -0.151357 3
如果您只需要合并new_val"(没有任何其他列),您通常可以在合并之前对列进行子集:
left.merge(right3[['key', 'newcol']], on='key')键值 newcol0 乙 0.400157 01 d 2.240893 1
如果你在做一个LEFT OUTER JOIN,一个更高性能的解决方案将涉及map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))键值 newcol0 A 1.764052 NaN1 乙 0.400157 0.02 C 0.978738 NaN3D 2.240893 1.0
如前所述,这类似于,但比
left.merge(right3[['key', 'newcol']], on='key', how='left')键值 newcol0 A 1.764052 NaN1 乙 0.400157 0.02 C 0.978738 NaN3D 2.240893 1.0
多列合并
要加入多个列,请为on
(或left_on
和right_on
,视情况而定)指定一个列表.
left.merge(right, on=['key1', 'key2'] ...)
或者,如果名称不同,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
其他有用的merge*
操作和函数
将 DataFrame 与索引上的系列合并:请参阅此答案.
除了
merge
,DataFrame.update
和DataFrame.combine_first
在某些情况下也用于将一个 DataFrame 更新为另一个.pd.merge_ordered
是一个用于有序 JOIN 的有用函数.pd.merge_asof
(读作:merge_asOf)对于近似连接很有用.
本节仅涵盖非常基础的内容,旨在满足您的胃口.有关更多示例和案例,请参阅关于 merge
的 文档,join
和 concat
以及功能规范的链接.
继续阅读
跳转到 Pandas Merging 101 中的其他主题以继续学习:
- How can I perform a (
INNER
| (LEFT
|RIGHT
|FULL
)OUTER
)JOIN
with pandas? - How do I add NaNs for missing rows after a merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
- Cross join with pandas
merge
?join
?concat
?update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
Merging basics - basic types of joins (read this first)
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- merging with multiple columns
- avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Enough talk - just show me how to use merge
!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
To perform an INNER JOIN, call merge
on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left
and right
which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right'
:
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'
.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
这篇关于 pandas 合并 101的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!