问题描述
现实生活df是无法加载到驱动程序内存中的海量数据帧。
可以使用常规或熊猫udf吗?
Real life df is a massive dataframe that cannot be loaded into driver memory.Can this be done using regular or pandas udf?
# Code to generate a sample dataframe
from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import numpy as np
sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
['425',[[1,1,0,0,0,1,0,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1],[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
]
df = spark.createDataFrame(sample,["id", "data"])
这是需要不依赖驱动程序内存而进行并行化的逻辑。
Here's the logic that needs to be parallelized without relying on driver memory.
输入:Spark dataframe
输出:将numpy数组输入到horovod中(如下所示:)
Input: Spark dataframeOutput: numpy array to be fed into horovod (Something like this: https://docs.databricks.com/applications/deep-learning/distributed-training/mnist-tensorflow-keras.html)
pandas_df = df.toPandas() # Not possible in real life
data_array = np.asarray(list(pandas_df.data.values))
data_array = data_array.reshape(data_array.shape[0], data_array.shape[1], -1, 1, order='F')
data_array = data_array.reshape(data_array.shape[0],data_array.shape[1],-1,1,1,order="F").transpose(0,1,3,2,-1)
# Some more numpy specific transformations ..
这是行不通的方法:
@pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def generate_feature(x):
data_array = np.asarray(x)
data_array = data_array.reshape(data_array.shape[0], ..
...
return pd.Series(data_array)
df = df.withColumn("data_array", generate_feature(df.data))
推荐答案
我正在尝试使用图像处理类似情况。我正在尝试。您可以将数据从Rdd保存为Parquet格式,然后在horovod中使用。
-我尚未对此进行测试。
-如何使用中的秩获取零件数据集horovod,也需要进行测试。
只是一个有用的提示。
谢谢。
I am trying to work on a similar case though using Images. I am looking towards Petastorm for doing this. You can save your data from Rdd to Parquet format and then use it in horovod.
- I am yet to test this.
- How to fetch the dataset in parts using ranks in horovod, needs to be tested too.
Just a tip that could help.
Thanks.
这篇关于通过udf将数据帧火花到numpy数组或不收集到驱动程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!