我有一个Python RDD:
rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)
我基于此RDD创建了一个数据框:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
我想对两个
columns 6 and 7
进行测试:如果数据帧中存在列6并且不为null,则应返回包含
column 6
值的数据帧,否则应返回包含column 7
值的数据帧。这是我的小代码:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
语法正确吗,我是否有权使用Python这样编写?
最佳答案
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)
我很确定,如果column6不存在,您将抛出KeyError。
您可以执行以下操作:
if 'column6' in logsDF.columns:
if logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
首先检查logsDF列中是否存在column6。
如果是这样,请查看any()值是否不为null。
如果column6不存在,或者column6存在但所有值均为空,则使用Column7。
编辑我自己的评论:
由于如果第一个条件为False,则python不会评估第二个条件,因此您可以执行以下操作:
if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
只要logsDF.columns中的'column6'首先出现,如果column6不存在,logsDF ['column6']将永远不会评估并抛出KeyError。