给定具有多层列的以下DF:

arrays = [['foo', 'foo', 'bar', 'bar'],
          ['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(6,4), columns = columnValues)
df['txt'] = 'aaa'
print(df)


产量:

        foo                 bar            txt
          A         B         C         D
0  0.080029  0.710943  0.157265  0.774827  aaa
1  0.276949  0.923369  0.550799  0.758707  aaa
2  0.416714  0.440659  0.835736  0.130818  aaa
3  0.935763  0.908967  0.502363  0.677957  aaa
4  0.191245  0.291017  0.014355  0.762976  aaa
5  0.365464  0.286350  0.450263  0.509556  aaa


问题:如果大型DF的值foo如何将100子列中的值有效地更改为< 0.5



以下作品:

In [41]: df.foo < 0.5
Out[41]:
       A      B
0   True  False
1   True  False
2   True   True
3  False  False
4   True   True
5   True   True

In [42]: df.foo[df.foo < 0.5]
Out[42]:
          A         B
0  0.080029       NaN
1  0.276949       NaN
2  0.416714  0.440659
3       NaN       NaN
4  0.191245  0.291017
5  0.365464  0.286350


但是,如果我尝试更改该值,则会抛出异常:

In [45]: df.foo[df.foo < 0.5] = 100
C:\Users\USER\AppData\Local\Programs\Python35\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead


如果我尝试使用定位器:

In [46]: df.foo.loc[df.foo < 0.5] = 100
...
ValueError: cannot copy sequence with size 2 to array axis with dimension 6


df.foo.loc[df.foo < 0.5, 'foo'] = 100的相同错误

如果我尝试:

df.loc[df.foo < 0.5, 'foo']


我得到:

KeyError: 'None of [       A      B\n0   True  False\n1   True  False\n2   True   True\n3  False  False\n4   True   True\n5   True   True] are in the [index]'




解决方案-与具有10M行的DF进行时间比较:

In [19]: %timeit df.foo.applymap(lambda x: x if x >= 0.5 else 100)
1 loop, best of 3: 29.4 s per loop

In [20]: %timeit df.foo[df.foo >= 0.5].fillna(100)
1 loop, best of 3: 1.55 s per loop


约翰·加尔特(John Galt):

In [21]: %timeit df.foo.where(df.foo < 0.5, 100)
1 loop, best of 3: 1.12 s per loop


B.M .:

In [5]: %timeit u=df['foo'].values;u[u<.5]=100
1 loop, best of 3: 628 ms per loop

最佳答案

这是使用where-df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)的一种方法

In [96]: df
Out[96]:
        foo                 bar            txt
          A         B         C         D
0  0.255309  0.237892  0.491065  0.930555  aaa
1  0.859998  0.008269  0.376213  0.984806  aaa
2  0.479928  0.761266  0.993970  0.266486  aaa
3  0.078284  0.009748  0.461687  0.653085  aaa
4  0.923293  0.642398  0.629140  0.561777  aaa
5  0.936824  0.526626  0.413250  0.732074  aaa

In [97]: df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)

In [98]: df
Out[98]:
          foo                   bar            txt
            A           B         C         D
0    0.255309    0.237892  0.491065  0.930555  aaa
1  100.000000    0.008269  0.376213  0.984806  aaa
2    0.479928  100.000000  0.993970  0.266486  aaa
3    0.078284    0.009748  0.461687  0.653085  aaa
4  100.000000  100.000000  0.629140  0.561777  aaa
5  100.000000  100.000000  0.413250  0.732074  aaa

关于python - 在Pandas DF中使用多级列有条件地更改值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36700207/

10-12 17:23
查看更多