我在 Pandas DataFrame 中有一个 full_name
列,其中包含个人姓名。例如:
Full_Name
Saumendra Nayak
Pawan Shinde
Arun Chopra
Neil Anderson
我必须将这些名字分成名字、第二名和姓氏。我决定使用 HumanName (nameparser) 库。
但是,使用我目前的方法,我必须使用循环将列中的每个名称拆分为其组件。
# add blank columns based on unique categories
df["title"] = ""
df["first"] = ""
df["middle"] = ""
df["last"] = ""
df["suffix"] = ""
df["nickname"] = ""
# Split name for each row and save values in dataframe
for i in range(df.shape[0]):
df.loc[i,7]=HumanName(df.full_Name.loc[i]).title
df.loc[i,8]=HumanName(df.full_Name.loc[i]).first
df.loc[i,9]=HumanName(df.full_Name.loc[i]).middle
df.loc[i,10]=HumanName(df.full_Name.loc[i]).last
df.loc[i,11]=HumanName(df.full_Name.loc[i]).suffix
df.loc[i,12]=HumanName(df.full_Name.loc[i]).nickname
我对 Python 有点陌生,这个循环似乎是我最好避免的东西。任何人都可以建议是否可以以矢量化方式使用 HumanName 库,以便我可以避免在上述代码中设置循环?
最佳答案
您可以尝试首先构建名称分解函数,然后在分配列之前将组件压缩在一起。
components = ('title', 'first', 'middle', 'last', 'suffix', 'nickname')
def name_decomp(n):
h_n = HumanName(n)
return (getattr(h_n, comp) for comp in components)
rslts = list(zip(*df.Full_Name.map(name_decomp)))
for i, comp in enumerate(components):
df[comp] = rslts[i]
类似的演示 (因为我没有那个库)
>>> df = pd.DataFrame(dict(strings=['calgary', 'vancouver', 'toronto']))
>>> df
strings
0 calgary
1 vancouver
2 toronto
>>> class Decomp:
def __init__(self, s):
self.s = s
self.first = s[0]
self.last = s[-1]
self.len = len(s)
>>> components = ('first', 'last', 'len')
>>> def useless_decomp(s):
dec_s = Decomp(s)
return (getattr(dec_s, comp) for comp in components)
>>> rslts = list(zip(*df.strings.map(useless_decomp)))
>>> for i, comp in enumerate(components):
df[comp] = rslts[i]
>>> df
strings first last len
0 calgary c y 7
1 vancouver v r 9
2 toronto t o 7
关于python - 为 Pandas 数据框列矢量化 HumanName 库,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42537150/