如何在python中比较两个DataFrame(StructType)

本文介绍了如何在python中比较两个DataFrame(StructType)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

本质上这是为了比较两个数据框，我可以将它们的名称与:

def diff(first, second):第二 = 设置(第二)如果项目不在第二个，则返回 [第一个项目的项目]

但我还想不仅在名称上进行比较，还想在数据类型上进行比较

示例数据框如下:

>>>pDF1.schema结构类型(列表(StructField(Scen_Id,IntegerType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型，字符串类型，真)，StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))>>>PDF2.schema结构类型(列表(StructField(Scen_Id,StringType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型，字符串类型，真)，StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))

从这个特殊的简化示例中可以看出(通常情况下我们的数据帧包含超过 100 个字段)，pDF2 与 pDF1 具有相同的名称/数据类型，除了第一个字段具有不同的数据类型.

非常感谢.

解决方案

好的，所以答案确实很简单，如下供未来读者参考:

def diff(first, second):第二 = 设置(第二)如果项目不在第二个，则返回 [第一个项目的项目]dl1_fields = 列表(pDF1.schema.fields)dl2_fields = 列表(pDF2.schema.fields)打印(==========================================================")print("模式比较结果:")打印(==========================================================")dl1Notdl2 = diff(dl1_fields, dl2_fields)打印(str(len(dl1Notdl2))+第一个df中的列，但不在第二个中")pprint.pprint(dl1Notdl2)打印(==========================================================")dl2Notdl1 = diff(dl2_fields, dl1_fields)打印(str(len(dl2Notdl1))+列在第二个df但不在第一个")pprint.pprint(dl2Notdl1)

Essentially this is to compare two dataframes, I am able to compare their names with:

def diff(first, second):
    second = set(second)
    return [item for item in first if item not in second]

But I also want to compare not only on name but also on datatype

Sample dataframe as below:

>>> pDF1.schema
StructType(
List(
StructField(Scen_Id,IntegerType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)

>>> pDF2.schema
StructType(
List(
StructField(Scen_Id,StringType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)

As you can see from this particular simplified example(often the case our dataframe contains over 100 fields), pDF2 has the same name/datatypeas pDF1, except for the first field, which has different datatype.

Thank you very much.

解决方案

OK, so the answer is indeed very straightforward as below for future reader's reference:

def diff(first, second):
    second = set(second)
    return [item for item in first if item not in second]

dl1_fields = list(pDF1.schema.fields)

dl2_fields = list(pDF2.schema.fields)

print("=========================================================")
print("schema comparison result:")
print("=========================================================")
dl1Notdl2 = diff(dl1_fields, dl2_fields)
print(str(len(dl1Notdl2)) + " columns in first df but not in second")
pprint.pprint(dl1Notdl2)
print("=========================================================")
dl2Notdl1 = diff(dl2_fields, dl1_fields)
print(str(len(dl2Notdl1)) + " columns in second df but not in first")
pprint.pprint(dl2Notdl1)

这篇关于如何在python中比较两个DataFrame(StructType)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！