问题描述
通常,我通过使用hashlib和.apply(hash)函数来对数据进行匿名处理.
Normally I anonymize my data by using hashlib and using the .apply(hash) function.
现在我正在尝试一种新方法,假设我必须遵循称为'data'的df:
Now im trying a new approach, imagine I have to following df called 'data':
df = pd.DataFrame({'contributor':['eric', 'frank', 'john', 'frank', 'barbara'],
'amount payed':[10,28,49,77,31]})
contributor amount payed
0 eric 10
1 frank 28
2 john 49
3 frank 77
4 barbara 31
我想通过将名称全部转换为person1
,person2
等来匿名化,就像这样:
Which I want to anonymize by turning the names all into person1
, person2
etc, like this:
output = pd.DataFrame({'contributor':['person1', 'person2', 'person3', 'person2', 'person4'],
'amount payed':[10,28,49,77,31]})
contributor amount payed
0 person1 10
1 person2 28
2 person3 49
3 person2 77
4 person4 31
所以我的第一个操作是对 name 列进行汇总,以便将名称附加到唯一索引,并且我可以将该索引用作'person'之后的数字.
So my first though was summarizing the name column so the names are attached to a unique index and I can use that index for the number after 'person'.
推荐答案
我认为更快的解决方案是使用 factorize
以获取唯一值,添加1
,转换为Series
和string
s,并在Person
字符串前添加:
I think faster solution is use factorize
for unique values, add 1
, convert to Series
and string
s and prepend Person
string:
df['contributor'] = 'Person' + pd.Series(pd.factorize(df['contributor'])[0] + 1).astype(str)
print (df)
contributor amount payed
0 Person1 10
1 Person2 28
2 Person3 49
3 Person2 77
4 Person4 31
这篇关于匿名化数据/替换名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!