问题描述
我有一个 DataFrame
df 和一个 dict
d,如下所示:
对于字典中的每个 (key, val)
,我想找到 a
列与键匹配的行,并覆盖它的 b
列的值.例如,在这种特殊情况下,第 1 行 b
的值将更改为 22,其第 3 行的值将更改为 26.
我该怎么做?
假设可以将新值传播到 a
列匹配的所有行(在事件在列 a
) 中有重复项,然后:
对于 d.iteritems() 中的 a_val、b_val:df['b'][df.a==a_val] = b_val
或避免链接赋值操作:
对于 d.iteritems() 中的 a_val、b_val:df.loc[df.a==a_val, 'b'] = b_val
请注意,要使用 loc
,您必须使用 Pandas 0.11 或更新版本.对于旧版本,您可以使用 .ix
来防止链式分配.
@Jeff 指向 this link 讨论了一种现象,我已经在此评论中提到过.请注意,这不是正确性的问题,因为颠倒访问顺序具有可预测的效果.你可以很容易地看到这一点,例如下面:
在[102]中:id(df[df.a==5]['b'])出[102]:113795992在 [103]: id(df['b'][df.a==5])出[103]:113725760
如果您先获取该列,然后根据该列的索引进行分配,则更改会影响该列.由于该列是 DataFrame 的一部分,因此更改会影响 DataFrame.如果您先索引一组行,那么您现在不再谈论同一个 DataFrame,因此从过滤对象中获取列不会让您看到原始列.
@Jeff 建议这使它不正确",而我的观点是这是明显的和预期的行为.在特殊情况下,当您有一个混合数据类型的列并且正在进行某种类型的提升/降级会阻止 Pandas 将值写入列中时,您可能会遇到正确性问题.但是鉴于在 Pandas 0.11 之前 loc
不可用,我认为指出如何使用链式赋值来实现它仍然是公平的,而不是假装 loc
是唯一的东西这可能是正确的选择.
如果有人能提供更明确的理由认为它是不正确的"(而不是只是在风格上不喜欢这种方式),请提供帮助,我将尝试对各种陷阱进行更彻底的撰写.
>I have a DataFrame
df, and a dict
d, like so:
>>> df
a b
0 5 10
1 6 11
2 7 12
3 8 13
4 9 14
>>> d = {6: 22, 8: 26}
For every (key, val)
in the dictionary, I'd like to find the row where column a
matches the key, and override its b
column with the value. For example, in this particular case, the value of b
in row 1 will change to 22, and its value on row 3 will change to 26.
How should I do that?
Assuming it would be OK to propagate the new values to all rows where column a
matches (in the event there were duplicates in column a
) then:
for a_val, b_val in d.iteritems():
df['b'][df.a==a_val] = b_val
or to avoid chaining assignment operations:
for a_val, b_val in d.iteritems():
df.loc[df.a==a_val, 'b'] = b_val
Note that to use loc
you must be working with Pandas 0.11 or newer. For older versions, you may be able to use .ix
to prevent the chained assignment.
@Jeff pointed to this link which discusses a phenomenon that I had already mentioned in this comment. Note that this is not an issue of correctness, since reversing the order of access has a predictable effect. You can see this easily, e.g. below:
In [102]: id(df[df.a==5]['b'])
Out[102]: 113795992
In [103]: id(df['b'][df.a==5])
Out[103]: 113725760
If you get the column first and then assign based on indexes into that column, the changes effect that column. And since the column is part of the DataFrame, the changes effect the DataFrame. If you index a set of rows first, you're now no longer talking about the same DataFrame, so getting the column from the filtered object won't give you a view of the original column.
@Jeff suggests that this makes it "incorrect" whereas my view is that this is the obvious and expected behavior. In the special case when you have a mixed data type column and there is some type promotion/demotion going on that would prevent Pandas from writing a value into the column, then you might have a correctness issue with this. But given that loc
is not available until Pandas 0.11, I think it's still fair to point out how to do it with chained assignment, rather than pretending like loc
is the only thing that could possibly ever be the correct choice.
If any one can provide more definitive reasons to think it is "incorrect" (as opposed to just not preferring this stylistically), please contribute and I will try to make a more thorough write-up about the various pitfalls.
这篇关于使用字典值覆盖 Pandas DataFrame 列,其中字典键匹配非索引列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!