排序/选择唯一的和最新的数据

本文介绍了排序/选择唯一的和最新的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从我的数据中获取最相关的价值.我想出使用 Pandas 中的 max 和 min 函数来获取最旧和最近的日期，但我找不到其余的修复程序.我试图从我的数据集中取出一家独特的公司和产品，并根据这些点获取其余的数据.如果有人能告诉我 Python 中用于解决此类问题的工具集，或有关如何在 Python 中解决此类问题的指导，那将非常有帮助.

对于security_level，superservere>severe >中等 >材料 >未成年
对于评级，如果我们看到同一家公司和产品同时获得真实和虚假，则为真实
对于 rating_level，关键 >高 >中 >低
对于 first_level，是最早的日期，对于 last_release，是最近的日期
score，同类产品和公司中的最高分

公司	产品	security_level	评分	rating_level	first_release	last_release	得分
谷歌	手机	次要	真的	关键	04/11/2020	03/17/2020	0.5
谷歌	操作系统	中等	错误	中等	09/05/2019	03/20/2021	0.009
谷歌	操作系统	次要	错误	低	09/04/2019	05/11/2020	19
谷歌	电视	严重	真的	高	08/11/2020	03/04/2021
谷歌	手机	超级严重	错误	中等	04/06/2015	08/19/2020	2.4
谷歌	手机	次要	错误	高	08/08/2019	08/19/2020	1.3
苹果	iphone	次要	真的	低	02/03/2020	10/13/2020	3
苹果	iphone	材料	真的	中等	01/21/2018	03/04/2021	6
苹果	iwatch	材料	错误	低	04/11/2015	08/13/2020	8
苹果	iphone	材料	真的	中等	10/20/2020	03/19/2021	5
戴尔	笔记本电脑	次要	错误	低	01/05/2021	03/20/2021	1

输出:

公司	产品	security_level	评分	rating_level	first_release	last_release	得分
谷歌	手机	超级严重	真的	关键	04/06/2015	08/19/2020	2.4
谷歌	操作系统	中等	错误	中等	09/04/2019	03/20/2021	19
谷歌	电视	严重	真的	高	08/11/2020	03/04/2021
苹果	iphone	材料	真的	中等	01/21/2018	03/19/2021	6
苹果	iwatch	材料	错误	低	04/11/2015	08/13/2020	8
戴尔	笔记本电脑	次要	错误	低	01/05/2021	03/20/2021	1

解决方案

更改 first_release 和 last_release 列的 dtype到 datetime

df['last_release'] = pd.to_datetime(df['last_release'])df['first_release'] = pd.to_datetime(df['first_release'])

将 security_level 和 rating_level 列转换为 ordered categorical 类型

df['rating_level'] = pd.Categorical(df['rating_level'], ['low', 'medium', 'high', 'critical'],ordered=True)df['security_level'] = pd.Categorical(df['security_level'], ['minor', 'material', 'moderate', 'severe', 'supersevere'],ordered=True)

Group 列company 和product 上的数据框，并使用agg_dict

agg_dict = {'security_level': 'max', 'rating': 'max', 'rating_level': 'max','first_release': 'min', 'last_release': 'max', 'score': 'max'}out = df.groupby(['company', 'product'], as_index=False, sort=False).agg(agg_dict)

结果

>>>出去公司产品安全_等级评级 rating_level first_release last_release score0 google mobile supersevere 真正的关键 2015-04-06 2020-08-19 2.41 谷歌操作系统中等 虚假中等 2019-09-04 2021-03-20 19.02 谷歌电视严重 真高 2020-08-11 2021-03-04 NaN3 苹果 iphone 材质 True medium 2018-01-21 2021-03-19 6.04 苹果 iwatch 材料假低 2015-04-11 2020-08-13 8.05 戴尔笔记本电脑轻微假低 2021-01-05 2021-03-20 1.0

I'm trying to take the most relevant value from my data here. I figured out to take the oldest and the most recent dates using max and min function in pandas but i couldn't find the fix to the rest. I'm trying to take one unqiue company and product from my data set and get the rest of their data based on these points. If anyone could tell me the toolsets used in python to address such issues that will be great or guidance on how such issues are addressed in python, that would be very helpful.

for security_level, superservere>severe > moderate > material > minor
for rating, take true if we see the same company and product got both true and flase
for rating_level, critical > high > medium > low
for first_level, the oldest date and for last_release, the most recent date
score, the highest score amongst the same prodcut and company

google	mobile	minor	TRUE	critical	04/11/2020	03/17/2020	0.5
google	os	moderate	FALSE	medium	09/05/2019	03/20/2021	0.009
google	os	minor	FALSE	low	09/04/2019	05/11/2020	19
google	tv	severe	TRUE	high	08/11/2020	03/04/2021
google	mobile	supersevere	FALSE	medium	04/06/2015	08/19/2020	2.4
google	mobile	minor	FALSE	high	08/08/2019	08/19/2020	1.3
apple	iphone	minor	TRUE	low	02/03/2020	10/13/2020	3
apple	iphone	material	TRUE	medium	01/21/2018	03/04/2021	6
apple	iwatch	material	FALSE	low	04/11/2015	08/13/2020	8
apple	iphone	material	TRUE	medium	10/20/2020	03/19/2021	5
dell	laptop	minor	FALSE	low	01/05/2021	03/20/2021	1

Output:

google	mobile	supersevere	TRUE	critical	04/06/2015	08/19/2020	2.4
google	os	moderate	FALSE	medium	09/04/2019	03/20/2021	19
google	tv	severe	TRUE	high	08/11/2020	03/04/2021
apple	iphone	material	TRUE	medium	01/21/2018	03/19/2021	6
apple	iwatch	material	FALSE	low	04/11/2015	08/13/2020	8
dell	laptop	minor	FALSE	low	01/05/2021	03/20/2021	1

解决方案

Change the dtype of first_release and last_release columns to datetime

df['last_release']  = pd.to_datetime(df['last_release'])
df['first_release'] = pd.to_datetime(df['first_release'])

Convert the columns security_level and rating_level to ordered categorical type

df['rating_level'] = pd.Categorical(df['rating_level'], ['low', 'medium', 'high', 'critical'], ordered=True)
df['security_level'] = pd.Categorical(df['security_level'], ['minor', 'material', 'moderate', 'severe', 'supersevere'], ordered=True)

Group the dataframe on columns company and product and aggregate the remaining columns with the corresponding aggregation functions specified in agg_dict

agg_dict = {'security_level': 'max', 'rating': 'max', 'rating_level': 'max',
            'first_release': 'min', 'last_release': 'max', 'score': 'max'}
            
out = df.groupby(['company', 'product'], as_index=False, sort=False).agg(agg_dict)

Result

>>> out

  company product security_level  rating rating_level first_release last_release  score
0  google  mobile    supersevere    True     critical    2015-04-06   2020-08-19    2.4
1  google      os       moderate   False       medium    2019-09-04   2021-03-20   19.0
2  google      tv         severe    True         high    2020-08-11   2021-03-04    NaN
3   apple  iphone       material    True       medium    2018-01-21   2021-03-19    6.0
4   apple  iwatch       material   False          low    2015-04-11   2020-08-13    8.0
5    dell  laptop          minor   False          low    2021-01-05   2021-03-20    1.0

这篇关于排序/选择唯一的和最新的数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！