问题描述
我有一个像这样的数据集
I have a dataset like this
Firstnames = ['AA','BB','CC','AA','CC']
Lastnames = ['P', 'Q', 'R', 'P', 'R']
values = [10, 13, 3, 22, 45]
df = pd.DataFrame(data = list(zip(Firstnames,Lastnames,values)), \
columns=['Firstnames','Lastnames','values'])
df
Firstnames Lastnames values
0 AA P 10
1 BB Q 13
2 CC R 3
3 AA P 22
4 CC R 45
我有一个这样的元组数组
I have an array of tuple like this
lst = array([('AA', 'P'), ('BB', 'Q')])
我想对df进行子集化,这样Firstname == 'AA' & Lastnames == 'P'
或Firstname == 'BB' & Lastnames == 'Q'
I want to subset df, such that Firstname == 'AA' & Lastnames == 'P'
or Firstname == 'BB' & Lastnames == 'Q'
我可以手动执行此操作,但是我的数组很大,我想以编程方式进行操作
I can do this manually, however my array is quite large and I want to do it programmatically
我的预期输出将是
Firstnames Lastnames values
AA P 10
AA P 22
BB Q 13
推荐答案
agg
+ isin
由于元组是可哈希的,因此可以使用isin
并将汇总的值与last
进行比较.直接使用lst
和列表而不是np.array
有帮助.
agg
+isin
Since tuples are hashable, you can use isin
and compare the aggregated values to your last
. Using lst
and a list directly instead of np.array
helps.
>>> lst = [('AA', 'P'),
('BB', 'Q')]
>>> mask = df[['Firstnames', 'Lastnames']].agg(tuple, 1).isin(lst)
>>> df[mask]
Firstnames Lastnames values
0 AA P 10
1 BB Q 13
3 AA P 22
如果需要,可以按名称sort_values
If you want, you can sort_values
by the names
>>> df[mask].sort_values(by=['Firstnames', 'Lastnames'])
Firstnames Lastnames values
0 AA P 10
3 AA P 22
1 BB Q 13
pd.concat
对于较小的lst
s,您还可以使用列表理解和pd.concat
pd.concat
You can also use a list comprehension and pd.concat
for smaller lst
s
>>> pd.concat([df[df.Firstnames.eq(a) & df.Lastnames.eq(b)] for a,b in lst])
Firstnames Lastnames values
0 AA P 10
3 AA P 22
1 BB Q 13
时间:
Timings:
小lst
,大df
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit mask = df[['Firstnames', 'Lastnames']].agg(tuple, 1).isin(lst); df[mask].sort_values(by=['Firstnames', 'Lastnames'])
942 ms ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat([df[df.Firstnames.eq(a) & df.Lastnames.eq(b)] for a,b in lst])
16.2 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于大lst
和小df
c = list(map(''.join, itertools.product(string.ascii_uppercase, string.ascii_uppercase)))
lst = [(a,b) for a,b in zip(c, list(string.ascii_uppercase)*26)]
df = pd.DataFrame({'Firstnames': c, 'Lastnames': list(string.ascii_uppercase)*26, 'values': 10})
%timeit mask = df[['Firstnames', 'Lastnames']].agg(tuple, 1).isin(lst); df[mask].sort_values(by=['Firstnames', 'Lastnames'])
15.1 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.concat([df[df.Firstnames.eq(a) & df.Lastnames.eq(b)] for a,b in lst])
781 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这篇关于基于元组的子集 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!