我在 Pandas DataFrame 中有一个 full_name 列,其中包含个人姓名。例如:

Full_Name

Saumendra Nayak
Pawan Shinde
Arun Chopra
Neil Anderson

我必须将这些名字分成名字、第二名和姓氏。我决定使用 HumanName (nameparser) 库。

但是,使用我目前的方法,我必须使用循环将列中的每个名称拆分为其组件。
# add blank columns based on unique categories

df["title"] = ""
df["first"] = ""
df["middle"] = ""
df["last"] = ""
df["suffix"] = ""
df["nickname"] = ""

# Split name for each row and save values in dataframe

for i in range(df.shape[0]):
    df.loc[i,7]=HumanName(df.full_Name.loc[i]).title
    df.loc[i,8]=HumanName(df.full_Name.loc[i]).first
    df.loc[i,9]=HumanName(df.full_Name.loc[i]).middle
    df.loc[i,10]=HumanName(df.full_Name.loc[i]).last
    df.loc[i,11]=HumanName(df.full_Name.loc[i]).suffix
    df.loc[i,12]=HumanName(df.full_Name.loc[i]).nickname

我对 Python 有点陌生,这个循环似乎是我最好避免的东西。任何人都可以建议是否可以以矢量化方式使用 HumanName 库,以便我可以避免在上述代码中设置循环?

最佳答案

您可以尝试首先构建名称分解函数,然后在分配列之前将组件压缩在一起。

components = ('title', 'first', 'middle', 'last', 'suffix', 'nickname')

def name_decomp(n):
    h_n = HumanName(n)
    return (getattr(h_n, comp) for comp in components)

rslts = list(zip(*df.Full_Name.map(name_decomp)))

for i, comp in enumerate(components):
    df[comp] = rslts[i]

类似的演示 (因为我没有那个库)
>>> df = pd.DataFrame(dict(strings=['calgary', 'vancouver', 'toronto']))

>>> df
     strings
0    calgary
1  vancouver
2    toronto

>>> class Decomp:
        def __init__(self, s):
            self.s = s
            self.first = s[0]
            self.last = s[-1]
            self.len = len(s)

>>> components = ('first', 'last', 'len')

>>> def useless_decomp(s):
        dec_s = Decomp(s)
        return (getattr(dec_s, comp) for comp in components)

>>> rslts = list(zip(*df.strings.map(useless_decomp)))

>>> for i, comp in enumerate(components):
        df[comp] = rslts[i]

>>> df
     strings first last  len
0    calgary     c    y    7
1  vancouver     v    r    9
2    toronto     t    o    7

关于python - 为 Pandas 数据框列矢量化 HumanName 库,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42537150/

10-12 06:15