问题描述
来自熊猫新手:我看到的数据基本上是这样的 -
data1 = pd.DataFrame({'Dir ':['E','E','W','W','E','W','W','E'],'Bool':['Y','N',' Y','N','Y','N','Y','N'],'Data':[4,5,6,7,8,9,10,11]},index = pd .DatetimeIndex([ '2000年12月30日', '2000年12月30日', '2000年12月30日', '2001年1月2日', '2001年1月3日',2001年1月3日','12 / 30/2000','12/30/2000']))
data1
Out [1]:
Bool Data Dir
2000-12-30 Y 4 E
2000-12-30 N 5 E
2000-12-30 Y 6 W
2001-01-02 N 7 W
2001-01-03 Y 8 E
2001-01-03 N 9 W
2000-12-30 Y 10 W
2000-12-30 N 11 E
我想按照多个级别对它进行分组,然后做一个cumsum():
例如 running_sum = data1.groupby(['Bool','Dir'])。cumsum()
< ;-(不起作用)
,其输出如下所示:
Bool Dir日期running_sum
NE 2000-12 -30 16
W 2001-01-02 7
2001-01-03 16
YE 2000-12-30 4
2001-01-03 12
W 2000-12-30 16
我的like代码显然不是很接近。我做了很多尝试,并且学习了很多关于如何不这样做的新东西。
感谢您提供任何帮助。 试试这个:
data2 = data1.reset_index()
data3 = data2.set_index([Bool,Dir ,index])#index是由reset_index
running_sum = data3.groupby(level = [0,1,2])。sum()。groupby(level = [0,1])创建的新列.cumsum()
你不能简单地使用 cumsum
on data3
与您的数据结构有关。按 Bool
和 Dir
进行分组并应用聚合函数( sum
, mean
等)会产生一个比你开始时更小的DataFrame,因为你使用的任何函数都会根据你的组密钥来聚合值。然而 cumsum
不是一个聚合函数。它将返回一个与它所调用的大小相同的DataFrame。因此,除非您输入的DataFrame格式在调用 cumsum
后输出的大小相同,否则会引发错误。这就是为什么我首先调用 sum
,它会以正确的输入格式返回一个DataFrame。
对不起,这很好地解释了这一点。也许别人能帮我一把?
From a Pandas newbie: I have data that looks essentially like this -
data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]:
Bool Data Dir
2000-12-30 Y 4 E
2000-12-30 N 5 E
2000-12-30 Y 6 W
2001-01-02 N 7 W
2001-01-03 Y 8 E
2001-01-03 N 9 W
2000-12-30 Y 10 W
2000-12-30 N 11 E
And I want to group it by multiple levels, then do a cumsum():
E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum()
<-(Doesn't work)
with output that would look something like:
Bool Dir Date running_sum
N E 2000-12-30 16
W 2001-01-02 7
2001-01-03 16
Y E 2000-12-30 4
2001-01-03 12
W 2000-12-30 16
My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.
Thanks for any help you can give.
Try this:
data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"]) # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()
The reason you cannot simply use cumsum
on data3
has to do with how your data is structured. Grouping by Bool
and Dir
and applying an aggregation function (sum
, mean
, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum
is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum
, it will throw an error. That's why I called sum
first, which returns a DataFrame in the correct input format.
Sorry if I haven't explained this well enough. Maybe someone else could help me out?
这篇关于在()组中使用 pandas cumsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!