python - 如何在Python中合并Spark SQL数据帧

这里有几种创建数据帧联合的方法，当我们谈论大数据帧时，哪种方法（最好）是最好/推荐的？我应该首先创建一个空的数据框，还是与创建的第一个数据框连续合并？

空数据框创建

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("A", StringType(), False),
    StructField("B", StringType(), False),
    StructField("C", StringType(), False)
])

pred_union_df = spark_context.parallelize([]).toDF(schema)

方法1-随行随行：

for ind in indications:
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
    pred = get_predictions(fitted_model, pred_output_df, ind)
    pred_union_df  = pred_union_df.union(pred[['A', 'B', 'C']])

方法2-最后的并集：

all_pred = []
for ind in indications:
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
    pred = get_predictions(fitted_model, pred_output_df, ind)
    all_pred.append(pred)
pred_union_df = pred_union_df.union(all_pred)

还是我全都错了？

编辑：
方法2是不可能的，因为我认为可以从answer中获得。我必须遍历列表并合并每个数据框。

最佳答案

方法2始终是首选，因为它避免了长的血统问题。

尽管DataFrame.union仅接受一个DataFrame作为参数，但RDD.union进行take a list。给定您的示例代码，您可以尝试在调用toDF之前合并它们。

如果您的数据在磁盘上，您还可以尝试load them all at once实现合并，例如，

dataframe = spark.read.csv([path1, path2, path3])

关于python - 如何在Python中合并Spark SQL数据帧，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/45551524/