我有以下数据框

ipdb> csv_data
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal    Set Null     20090101                     30
3         madhya      355883     20090101                     40
4          sudan    Set Null     20090101                     50


我想将所有包含Set Null的列值替换为Nan,所以我采用以下方式

import numpy

def set_NaN(element):
    if element == 'Set Null':
        return numpy.nan
    else:
        return element

csv_data = csv_data.applymap(lambda element: set_NaN(element))


但这并没有改变任何东西

ipdb> print csv_data
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal    Set Null     20090101                     30
3         madhya      355883     20090101                     40
4          sudan    Set Null     20090101                     50
ipdb>


但是当我仅打印如下的csv_data.applymap(lambda element: set_NaN(element))时,我可以看到输出,但是当分配回去时,我无法获得我想要的数据

ipdb> csv_data.applymap(lambda element: set_NaN(element))
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal         NaN     20090101                     30
3         madhya      355883     20090101                     40
4          sudan         NaN     20090101                     50


那么如何根据某些字符串用NaN替换列值?

最佳答案

您需要DataFrame.mask,它用True替换mask的NaN值。另外,有些列是数字列,因此需要先将df强制转换为string

print (csv_data.astype(str) == 'Set Null')
  country_edited sale_edited date_edited transformation_edited
0          False       False       False                 False
1          False       False       False                 False
2          False        True       False                 False
3          False       False       False                 False
4          False        True       False                 False


csv_data = csv_data.mask(csv_data.astype(str) == 'Set Null')
print (csv_data)
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal         NaN     20090101                     30
3         madhya      355883     20090101                     40
4          sudan         NaN     20090101                     50


numpy boolean mask的另一种解决方案-通过DataFrame.values比较numpy数组:

print (csv_data.values == 'Set Null')
[[False False False False]
 [False False False False]
 [False  True False False]
 [False False False False]
 [False  True False False]]

csv_data = csv_data.mask(csv_data.values == 'Set Null')
print (csv_data)
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal         NaN     20090101                     30
3         madhya      355883     20090101                     40
4          sudan         NaN     20090101                     50


在您的解决方案中,必须将数据分配回csv_data

def set_NaN(element):
    if element == 'Set Null':
        return numpy.nan
    else:
        return element

csv_data = csv_data.applymap(lambda element: set_NaN(element))
print (csv_data)
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal         NaN     20090101                     30
3         madhya      355883     20090101                     40
4          sudan         NaN     20090101                     50

关于python - 根据 Pandas 中的字符串用NaN替换列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/41851460/

10-12 19:29
查看更多