问题描述
我正在尝试删除名称"列下的重复/重复名称.我只想通过使用python脚本来避免重复/重复名称中的第一次出现.
I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
这是我输入的excel:
This is my input excel:
并且需要这样的输出:
推荐答案
这并不是删除重复的内容,例如,您只是将重复的键填充为一列中的空白,我将按以下方式进行处理:
This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
通过创建一个掩码,如果该行是==上面的行,则在其中返回一个true/false布尔值.
by creating a mask where you return a true/false boolean if the row is == the row above.
假设您的数据框称为df
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
说明:
我们在上面所做的是以下
what we are doing above is the following,
首先选择一个列,或者使用pandas术语系列,然后应用 .ne
(不等于),实际上是!=
first selecting a single column, or in pandas terminology a series, we then apply a .ne
(not equal to) which in effect is !=
让我们拭目以待.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
现在,让我们创建一个与您相似的数据框.
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
因此,如果我们执行了 df ['NAME'].ne('Omar')
,则与 [df ['NAME']!='Omar']
我们会得到.
so, if we did df['NAME'].ne('Omar')
which is the same as [df['NAME'] != 'Omar']
we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
因此,我们想看看第1行中的名称(记住python是0索引语言,所以第1行实际上是第二物理行)是 .eq
到上面的行.
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq
to the row above.
我们通过调用超链接的 [.shift] [2]
来获取更多信息.
we do this by calling [.shift][2]
hyperlinked for more info.
这基本上是将行按其索引与已定义的变量号一起移动,让我们将其称为n.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
如果我们调用了 df ['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
我们可以在这里看到Rekha已下山
we can see here that that Rekha has moved down
将所有内容放在一起
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
我们将其分配给一个名为 mask
的自定义变量,您可以根据需要调用此变量.
we assign this to a self defined variable called mask
you could call this whatever you want.
然后我们使用 [.loc] [2]
,它使您可以通过标签或布尔数组(在本例中为数组)访问数据框.
we then use [.loc][2]
which lets you access your dataframe by labels or a boolean array, in this instance an array.
但是,我们只想访问为False的布尔值,因此我们使用〜
来反转数组的逻辑.
however, we only want to access the booleans which are False so we use a ~
which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
我们现在要做的就是将这些行更改为空白,作为您的初始要求,我们留了下来.
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
希望有帮助!
这篇关于使用python删除Excel中的重复/重复出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!