我有一个Python RDD:

rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)


我基于此RDD创建了一个数据框:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])


我想对两个columns 6 and 7进行测试:
如果数据帧中存在列6并且不为null,则应返回包含column 6值的数据帧,否则应返回包含column 7值的数据帧。
这是我的小代码:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")


语法正确吗,我是否有权使用Python这样编写?

最佳答案

if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)


我很确定,如果column6不存在,您将抛出KeyError。

您可以执行以下操作:

if 'column6' in logsDF.columns:
    if logsDF['column6'].notnull().any():
        logsDF.select("column1","column2","column3","column4","column5","column6")
    else:
        logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")


首先检查logsDF列中是否存在column6。
如果是这样,请查看any()值是否不为null。

如果column6不存在,或者column6存在但所有值均为空,则使用Column7。



编辑我自己的评论:
由于如果第一个条件为False,则python不会评估第二个条件,因此您可以执行以下操作:

if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")


只要logsDF.columns中的'column6'首先出现,如果column6不存在,logsDF ['column6']将永远不会评估并抛出KeyError。

10-01 00:14