python - 如何在 Pandas 数据框Python中使用定界符查找字符串并将其替换为新行

我试图弄清楚如何解决以下问题：

我有pandas数据框，其中包含一些以'，'分隔的字符串。我的目标是找到这些并将其替换为新行，以便数据框内不再有分隔符。例如，一个单元格包含“ hi，there”，而我希望它成为“ hi”和“ there”，因此将有两行而不是最后一行。

这应该一直应用到原始数据帧中没有定界符为止，以便在两行不同的一行中有两个单词（“ hi，there”和“ whats，up，there”）的情况下，它变为6行而不是原始的（笛卡尔积）。数据框内的所有行应采用相同的方法。

下面是演示原始数据帧（a）的代码以及我想以此结尾的结果：

a = pd.DataFrame([['Hi,there', 'fv', 'whats,up,there'],['dfasd', 'vfgfh', 'kutfddx'],['fdfa', 'uyg', 'iutyfrd']], columns = ['a', 'b', 'c'])

输出：

python - 如何在 Pandas 数据框Python中使用定界符查找字符串并将其替换为新行-LMLPHP

所需的输出在这里：

到目前为止，我已经为此目的复制了很多行，但是我无法弄清楚如何用想要的内容替换分隔的单词：

ndf = pd.DataFrame([])
for i in a.values:
    n = 1
    for j in i:
        if ',' in j:
            n = n*len(j.split(','))
    ndf = ndf.append([i]*n, ignore_index=False)

这将产生：

知道如何进行吗？我只能为此使用pandas和numpy，但是我坚信它足够了。

最佳答案

首先我用逗号分隔，然后使用stack()函数

a_list = a.apply(lambda x : x.str.split(','))

for i in a_list:
    tmp = pd.DataFrame.from_records(a_list[i].tolist()).stack().reset_index(level=1, drop=True).rename('new_{}'.format(i))
    a = a.drop(i, axis=1).join(tmp)

a = a.reset_index(drop=True)

结果：

>>> a
   new_a    new_c  new_b
0     Hi    whats     fv
1     Hi       up     fv
2     Hi    there     fv
3  there    whats     fv
4  there       up     fv
5  there    there     fv
6  dfasd  kutfddx  vfgfh
7   fdfa  iutyfrd    uyg

更新资料

要处理缺失值（np.nan和None），首先将其转换为字符串，然后执行与普通数据相同的操作，然后将NaN字符串替换为np.nan。

让我们插入一些缺失的值

import numpy as np
a['a'].loc[0] = np.nan
a['b'].loc[1] = None

#        a     b               c
# 0    NaN    fv  whats,up,there
# 1  dfasd  None         kutfddx
# 2   fdfa   uyg         iutyfrd

a.fillna('NaN', inplace=True) # some string

#
# insert the code above (with for loop)
#

a.replace('NaN', np.nan, inplace=True)

#    new_a new_b    new_c
# 0    NaN    fv    whats
# 1    NaN    fv       up
# 2    NaN    fv    there
# 3  dfasd   NaN  kutfddx
# 4   fdfa   uyg  iutyfrd

关于python - 如何在 Pandas 数据框Python中使用定界符查找字符串并将其替换为新行，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/51809661/