问题描述
本质上这是为了比较两个数据框,我可以将它们的名称与:
def diff(first, second):第二 = 设置(第二)如果项目不在第二个,则返回 [第一个项目的项目]
但我还想不仅在名称上进行比较,还想在数据类型上进行比较
示例数据框如下:
>>>pDF1.schema结构类型(列表(StructField(Scen_Id,IntegerType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型,字符串类型,真),StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))>>>PDF2.schema结构类型(列表(StructField(Scen_Id,StringType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型,字符串类型,真),StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))从这个特殊的简化示例中可以看出(通常情况下我们的数据帧包含超过 100 个字段),pDF2 与 pDF1 具有相同的名称/数据类型,除了第一个字段具有不同的数据类型.
非常感谢.
好的,所以答案确实很简单,如下供未来读者参考:
def diff(first, second):第二 = 设置(第二)如果项目不在第二个,则返回 [第一个项目的项目]dl1_fields = 列表(pDF1.schema.fields)dl2_fields = 列表(pDF2.schema.fields)打印(==========================================================")print("模式比较结果:")打印(==========================================================")dl1Notdl2 = diff(dl1_fields, dl2_fields)打印(str(len(dl1Notdl2))+第一个df中的列,但不在第二个中")pprint.pprint(dl1Notdl2)打印(==========================================================")dl2Notdl1 = diff(dl2_fields, dl1_fields)打印(str(len(dl2Notdl1))+列在第二个df但不在第一个")pprint.pprint(dl2Notdl1)
Essentially this is to compare two dataframes, I am able to compare their names with:
def diff(first, second):
second = set(second)
return [item for item in first if item not in second]
But I also want to compare not only on name but also on datatype
Sample dataframe as below:
>>> pDF1.schema
StructType(
List(
StructField(Scen_Id,IntegerType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)
>>> pDF2.schema
StructType(
List(
StructField(Scen_Id,StringType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)
As you can see from this particular simplified example(often the case our dataframe contains over 100 fields), pDF2 has the same name/datatypeas pDF1, except for the first field, which has different datatype.
Thank you very much.
OK, so the answer is indeed very straightforward as below for future reader's reference:
def diff(first, second):
second = set(second)
return [item for item in first if item not in second]
dl1_fields = list(pDF1.schema.fields)
dl2_fields = list(pDF2.schema.fields)
print("=========================================================")
print("schema comparison result:")
print("=========================================================")
dl1Notdl2 = diff(dl1_fields, dl2_fields)
print(str(len(dl1Notdl2)) + " columns in first df but not in second")
pprint.pprint(dl1Notdl2)
print("=========================================================")
dl2Notdl1 = diff(dl2_fields, dl1_fields)
print(str(len(dl2Notdl1)) + " columns in second df but not in first")
pprint.pprint(dl2Notdl1)
这篇关于如何在python中比较两个DataFrame(StructType)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!