问题描述
我想计算groupby上有多少个一致的增加,以及第一个元素和最后一个元素之间的差异.但是我不能在groupby上应用该功能. groupby之后,它是一个列表吗?而且"apply"和"agg"之间有什么区别?抱歉,我刚接触python几天了.
I want to count how many consistent increase, and the difference between the first element and the last element, on a groupby. But I can't apply the function on the groupby. After groupby, is it a list? And also what's the difference between "apply" and "agg"? Sorry, I just touched the python for a few days.
def promotion(ls):
pro =0
if len(ls)>1:
for j in range(1,len(ls)):
if ls[j]>ls[j-1]:
pro + = 1
return pro
def growth(ls):
head= ls[0]
tail= ls[len(ls)-1]
gro= tail-head
return gro
titlePromotion= JobData.groupby("candidate_id")["TitleLevel"].apply(promotion)
titleGrowth= JobData.groupby("candidate_id")["TitleLevel"].apply(growth)
数据为:
candidate_id TitleLevel othercols
1 2 foo
2 1 bar
2 2 goo
2 1 gar
The result should be
titlePromotion
candidate_id
1 0
2 1
titleGrowth
candidate_id
1 0
2 0
推荐答案
import pandas as pd
def promotion(ls):
return (ls.diff() > 0).sum()
def growth(ls):
return ls.iloc[-1] - ls.iloc[0]
jobData = pd.DataFrame(
{'candidate_id': [1, 2, 2, 2],
'TitleLevel': [2, 1, 2, 1]})
grouped = jobData.groupby("candidate_id")
titlePromotion = grouped["TitleLevel"].agg(promotion)
print(titlePromotion)
# candidate_id
# 1 0
# 2 1
# dtype: int64
titleGrowth = grouped["TitleLevel"].agg(growth)
print(titleGrowth)
# candidate_id
# 1 0
# 2 0
# dtype: int64
一些提示:
Some tips:
如果您定义通用函数
def foo(ls):
print(type(ls))
并致电
jobData.groupby("candidate_id")["TitleLevel"].apply(foo)
Python将打印
<class 'pandas.core.series.Series'>
这是一种低调但有效的方法,可发现调用jobData.groupby(...)[...].apply(foo)
将Series
传递给foo
.
This is a low-brow but effective way to discover that calling jobData.groupby(...)[...].apply(foo)
passes a Series
to foo
.
apply
方法为每个组调用一次foo
.它可以返回一个Series或一个DataFrame,并将结果块粘合在一起.当foo
返回诸如数值或字符串之类的对象时,可以使用apply
,但是在这种情况下,我认为首选使用agg
.使用apply
的典型用例是,例如,要对组中的每个值求平方,因此需要返回形状相同的新组.
The apply
method calls foo
once for every group. It can return a Series or a DataFrame with the resulting chunks glued together. It is possible to use apply
when foo
returns an object such as a numerical value or string, but in such cases I think using agg
is preferred. A typical use case for using apply
is when you want to, say, square every value in a group and thus need to return a new group of the same shape.
在这种情况下,transform
方法也很有用-当您要对组中的每个值进行转换并因此需要返回相同形状的东西时-但结果可能是与apply
有所不同,因为可能将不同的对象传递给foo
(例如,使用transform
时,分组数据帧的每一列都将传递给foo
,而整个组将传递给foo
c2>使用apply
时.最简单的理解方法是尝试使用简单的数据框和通用的foo
.
The transform
method is also useful in this situation -- when you want to transform every value in the group and thus need to return something of the same shape -- but the result can be different than that with apply
since a different object may be passed to foo
(for example, each column of a grouped dataframe would be passed to foo
when using transform
, while the entire group would be passed to foo
when using apply
. The easiest way to understand this is to experiment with a simple dataframe and the generic foo
.)
agg
方法为每个组调用一次foo
,但是与apply
不同,它应为每个组返回一个数字.该组被聚合成一个值.使用agg
的典型用例是当您要计算组中的项目数时.
The agg
method calls foo
once for every group, but unlike apply
it should return a single number per group. The group is aggregated into a value. A typical use case for using agg
is when you want to count the number of items in the group.
您可以使用通用的foo
函数来调试并了解原始代码出了什么问题:
You can debug and understand what went wrong with your original code by using the generic foo
function:
In [30]: grouped['TitleLevel'].apply(foo)
0 2
Name: 1, dtype: int64
--------------------------------------------------------------------------------
1 1
2 2
3 1
Name: 2, dtype: int64
--------------------------------------------------------------------------------
Out[30]:
candidate_id
1 None
2 None
dtype: object
这向您显示了正在传递给foo
的系列.请注意,在第二个系列中,索引值为1和2.因此,由于在第二个系列中没有带有值0
的标签,因此ls[0]
会引发一个KeyError
.
This shows you the Series that are being passed to foo
. Notice that in the second Series, then index values are 1 and 2. So ls[0]
raises a KeyError
, since there is no label with value 0
in the second Series.
您真正想要的是系列中的第一项.这就是iloc
的目的.
What you really want is the first item in the Series. That is what iloc
is for.
因此,总结起来,请使用ls[label]
选择索引值为label
的系列的行.使用ls.iloc[n]
选择系列的第n
行.
So to summarize, use ls[label]
to select the row of a Series with index value of label
. Use ls.iloc[n]
to select the n
th row of the Series.
因此,要用最少的更改来修正代码,您可以使用
Thus, to fix your code with a the least amount of change, you could use
def promotion(ls):
pro =0
if len(ls)>1:
for j in range(1,len(ls)):
if ls.iloc[j]>ls.iloc[j-1]:
pro += 1
return pro
def growth(ls):
head= ls.iloc[0]
tail= ls.iloc[len(ls)-1]
gro= tail-head
return gro
这篇关于将功能应用于groupby功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!