本文介绍了使用python删除Excel中的重复/重复出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试删除名称"列下的重复/重复名称.我只想通过使用python脚本来避免重复/重复名称中的第一次出现.

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.

这是我输入的excel:

This is my input excel:

并且需要这样的输出:

推荐答案

这并不是删除重复的内容,例如,您只是将重复的键填充为一列中的空白,我将按以下方式进行处理:

This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :

通过创建一个掩码,如果该行是==上面的行,则在其中返回一个true/false布尔值.

by creating a mask where you return a true/false boolean if the row is == the row above.

假设您的数据框称为df

assuming your dataframe is called df

mask = df['NAME'].ne(df['NAME'].shift())

df.loc[~mask,'NAME'] = ''

说明:

我们在上面所做的是以下

what we are doing above is the following,

首先选择一个列,或者使用pandas术语系列,然后应用 .ne (不等于),实际上是!=

first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=

让我们拭目以待.

import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']

现在,让我们创建一个与您相似的数据框.

now, lets create a dataframe similar to yours.

df = pd.DataFrame({'NAME' : names,
         'DEFAULT' : defaults,
         'CLASS' : classes,
         'AGE' : [np.random.randint(1,5) for len in names],
         'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.

因此,如果我们执行了 df ['NAME'].ne('Omar'),则与 [df ['NAME']!='Omar'] 我们会得到.

so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True

因此,我们想看看第1行中的名称(记住python是0索引语言,所以第1行实际上是第二物理行)是 .eq 到上面的行.

so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.

我们通过调用超链接的 [.shift] [2] 来获取更多信息.

we do this by calling [.shift][2] hyperlinked for more info.

这基本上是将行按其索引与已定义的变量号一起移动,让我们将其称为n.

what this basically does is shift the rows by its index with a defined variable number, lets call this n.

如果我们调用了 df ['NAME'].shift(1)

0       NaN
1     Rekha
2     Rekha
3      Jaya
4      Jaya
5    Sushma
6      Nita
7      Nita

我们可以在这里看到Rekha已下山

we can see here that that Rekha has moved down

将所有内容放在一起

df['NAME'].ne(df['NAME'].shift())
0     True
1    False
2     True
3    False
4     True
5     True
6    False
7    False

我们将其分配给一个名为 mask 的自定义变量,您可以根据需要调用此变量.

we assign this to a self defined variable called mask you could call this whatever you want.

然后我们使用 [.loc] [2] ,它使您可以通过标签或布尔数组(在本例中为数组)访问数据框.

we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.

但是,我们只想访问为False的布尔值,因此我们使用来反转数组的逻辑.

however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.

    NAME    DEFAULT CLASS   AGE GROUP
1   Rekha       third   1   4
3   Jaya        fifth   1   1
6   Nita        fifth   1   2
7   Nita        fourth  1   4

我们现在要做的就是将这些行更改为空白,作为您的初始要求,我们留了下来.

all we need to do now is change these rows to blanks as your initial requirment, and we are left with.

    NAME    DEFAULT   CLASS AGE GROUP
0   Rekha             forth 2   2
1                     third 1   4
2   Jaya    c-default forth 3   3
3                     fifth 1   1
4   Sushma            fourth3   1
5   Nita    c-default third 4   2
6                     fifth 1   2
7                     fourth1   4

希望有帮助!

这篇关于使用python删除Excel中的重复/重复出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 12:33