本文介绍了当我以以下方式修改 pandas 数据框时会发生什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图了解这种行为(为什么发生;如果是故意的,那么这样做的动机是什么?)

trying to understand this behavior (why it happens; and if it was intentional, then what was the motivation for it to be done this way)

所以我创建一个数据框

np.random.seed(0)
df = pd.DataFrame(np.random.random((4,2)))


          0         1
0  0.548814  0.715189
1  0.602763  0.544883
2  0.423655  0.645894
3  0.437587  0.891773

我可以像这样引用列

df.columns = ['a','b']
df.a
          0
0  0.548814
1  0.602763
2  0.423655
3  0.437587 

我什至可以创造一个新的专栏

I can even make, what I think is a new column

 df.third = pd.DataFrame(np.random.random((4,1)))

df仍然

df
          0         1
0  0.548814  0.715189
1  0.602763  0.544883
2  0.423655  0.645894
3  0.437587  0.891773

但是,df.third也存在(但是我在Spyder的变量查看器中看不到它)

however, df.third also exists (but i can't see it in my variable viewer in Spyder)

df.third
          0
0  0.118274
1  0.639921
2  0.143353
3  0.944669

如果我想添加第三列,则必须执行以下操作

if I wanted to add a third column, I'd have to do the following

df['third'] = pd.DataFrame(np.random.random((4,1)))

          a         b     third
0  0.548814  0.715189  0.568045
1  0.602763  0.544883  0.925597
2  0.423655  0.645894  0.071036
3  0.437587  0.891773  0.087129

所以,我的问题是,当我做df.third与df ['third']时会发生什么?

So, my question is what's going on when I do df.third versus df['third']?

推荐答案

由于它添加了third作为属性,因此应停止访问列作为属性,并始终使用df['third']以避免模棱两可的行为.

Because it added third as an attribute, you should stop accessing columns as an attribute and always use df['third'] to avoid ambiguous behaviour.

您应该养成始终使用df[col_name]访问和分配列的习惯,这是为了避免出现类似问题

You should get into the habit of always accessing and assigning columns using df[col_name], this is to avoid problems like

df.mean = some_calc()

这里的问题是mean是DataFrame的方法

well the problem here is that mean is a method for a DataFrame

因此,您然后用一些计算值覆盖了方法.

So you've then overwritten a method with some computed value.

这里的问题是,这是为了方便起见而设计的一部分,数据分析书和一些早期的在线视频演示中的大熊猫将这作为分配给新列的一种方式,但是细微的错误可能如此普遍以至于确实应该禁止并删除IMO

The problem here is that this was part of the design as a convenience and the pandas for data analysis book and some early online video presentations showed this as a way of assigning to a new column but the subtle errors can be so pervasive that it really should be banned and removed IMO

很抱歉,我不能对此施加足够的压力,停止将列作为属性引用,这是我的一个严重错误,不幸的是,我仍然看到很多答案显示此用法

Seriously I can't stress this enough, stop referring to columns as an attribute, it's a serious bugbear of mine and unfortunately I still see lots of answers posted showing this usage

您会看到没有添加新列:

You can see that no new column is added:

In [97]:
df.third = pd.DataFrame(np.random.random((4,1)))
df.columns

Out[97]:
Index(['a', 'b'], dtype='object')

您可以看到third被添加为属性:

You can see that third was added as an attribute:

In [98]:
df.__dict__

Out[98]:
{'_data': BlockManager
 Items: Index(['a', 'b'], dtype='object')
 Axis 1: Int64Index([0, 1, 2, 3], dtype='int64')
 FloatBlock: slice(0, 2, 1), 2 x 4, dtype: float64,
 '_iloc': <pandas.core.indexing._iLocIndexer at 0x7e73b00>,
 '_item_cache': {},
 'is_copy': None,
 'third':           0
 0  0.844821
 1  0.286501
 2  0.459170
 3  0.243452}

您可以看到您有Items__dataAxis 1等,但是随后您还有'third'这是一个属性

You can see that you have an Items, __data, Axis 1 etc but then you also have 'third' which is an attribute

这篇关于当我以以下方式修改 pandas 数据框时会发生什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 21:30