问题描述
我在 Pandas 中有一个数据帧 df
,它是使用 csv 文件中的 pandas.read_table
构建的.数据框有几列,并由其中一列索引(这是唯一的,因为每一行对于用于索引的列都有一个唯一值.)
I have a dataframe df
in pandas that was built using pandas.read_table
from a csv file. The dataframe has several columns and it is indexed by one of the columns (which is unique, in that each row has a unique value for that column used for indexing.)
如何根据应用于多列的复杂"过滤器来选择数据框的行?我可以轻松地选择列 colA
大于 10 的数据帧切片,例如:
How can I select rows of my dataframe based on a "complex" filter applied to multiple columns? I can easily select out the slice of the dataframe where column colA
is greater than 10 for example:
df_greater_than10 = df[df["colA"] > 10]
但是如果我想要一个过滤器,例如:选择 df
的切片,其中 any 的列大于 10?
But what if I wanted a filter like: select the slice of df
where any of the columns are greater than 10?
或者colA
的值大于10但colB
的值小于5?
Or where the value for colA
is greater than 10 but the value for colB
is less than 5?
这些是如何在 Pandas 中实现的?谢谢.
How are these implemented in pandas?Thanks.
推荐答案
我鼓励您在 邮件列表,但无论如何,使用底层 NumPy 数组仍然是一个非常低级的事情.例如,要选择任何列中的值超过的行,例如在此示例中为 1.5:
I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:
In [11]: df
Out[11]:
A B C D
2000-01-03 -0.59885 -0.18141 -0.68828 -0.77572
2000-01-04 0.83935 0.15993 0.95911 -1.12959
2000-01-05 2.80215 -0.10858 -1.62114 -0.20170
2000-01-06 0.71670 -0.26707 1.36029 1.74254
2000-01-07 -0.45749 0.22750 0.46291 -0.58431
2000-01-10 -0.78702 0.44006 -0.36881 -0.13884
2000-01-11 0.79577 -0.09198 0.14119 0.02668
2000-01-12 -0.32297 0.62332 1.93595 0.78024
2000-01-13 1.74683 -1.57738 -0.02134 0.11596
2000-01-14 -0.55613 0.92145 -0.22832 1.56631
2000-01-17 -0.55233 -0.28859 -1.18190 -0.80723
2000-01-18 0.73274 0.24387 0.88146 -0.94490
2000-01-19 0.56644 -0.49321 1.17584 -0.17585
2000-01-20 1.56441 0.62331 -0.26904 0.11952
2000-01-21 0.61834 0.17463 -1.62439 0.99103
2000-01-24 0.86378 -0.68111 -0.15788 -0.16670
2000-01-25 -1.12230 -0.16128 1.20401 1.08945
2000-01-26 -0.63115 0.76077 -0.92795 -2.17118
2000-01-27 1.37620 -1.10618 -0.37411 0.73780
2000-01-28 -1.40276 1.98372 1.47096 -1.38043
2000-01-31 0.54769 0.44100 -0.52775 0.84497
2000-02-01 0.12443 0.32880 -0.71361 1.31778
2000-02-02 -0.28986 -0.63931 0.88333 -2.58943
2000-02-03 0.54408 1.17928 -0.26795 -0.51681
2000-02-04 -0.07068 -1.29168 -0.59877 -1.45639
2000-02-07 -0.65483 -0.29584 -0.02722 0.31270
2000-02-08 -0.18529 -0.18701 -0.59132 -1.15239
2000-02-09 -2.28496 0.36352 1.11596 0.02293
2000-02-10 0.51054 0.97249 1.74501 0.20525
2000-02-11 0.10100 0.27722 0.65843 1.73591
In [12]: df[(df.values > 1.5).any(1)]
Out[12]:
A B C D
2000-01-05 2.8021 -0.1086 -1.62114 -0.2017
2000-01-06 0.7167 -0.2671 1.36029 1.7425
2000-01-12 -0.3230 0.6233 1.93595 0.7802
2000-01-13 1.7468 -1.5774 -0.02134 0.1160
2000-01-14 -0.5561 0.9215 -0.22832 1.5663
2000-01-20 1.5644 0.6233 -0.26904 0.1195
2000-01-28 -1.4028 1.9837 1.47096 -1.3804
2000-02-10 0.5105 0.9725 1.74501 0.2052
2000-02-11 0.1010 0.2772 0.65843 1.7359
必须使用 &
或 |
(和括号!)组合多个条件:
Multiple conditions have to be combined using &
or |
(and parentheses!):
In [13]: df[(df['A'] > 1) | (df['B'] < -1)]
Out[13]:
A B C D
2000-01-05 2.80215 -0.1086 -1.62114 -0.2017
2000-01-13 1.74683 -1.5774 -0.02134 0.1160
2000-01-20 1.56441 0.6233 -0.26904 0.1195
2000-01-27 1.37620 -1.1062 -0.37411 0.7378
2000-02-04 -0.07068 -1.2917 -0.59877 -1.4564
我很想拥有某种查询 API 来简化这些事情
I'd be very interested to have some kind of query API to make these kinds of things easier
这篇关于使用python pandas跨多列选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!