有没有办法为 Spark 数据帧添加额外的元数据?

本文介绍了有没有办法为 Spark 数据帧添加额外的元数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以向 DataFrames 添加额外的元数据?

Is it possible to add extra meta data to DataFrames?

我有 Spark DataFrame 需要保留额外信息.示例:一个 DataFrame，我想记住"整数 id 列中使用次数最多的索引.

I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.

我使用单独的 DataFrame 来存储这些信息.当然，单独保存这些信息既乏味又容易出错.

I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.

是否有更好的解决方案在 DataFrame 上存储此类额外信息?

Is there a better solution to store such extra information on DataFrames?

推荐答案

扩展和 Scala-fy nealmcb 的答案(问题被标记为 scala，而不是 python，所以我不认为这个答案会偏离主题或多余)，假设您有一个 DataFrame:

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:

import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")

还有一些方法可以在 DataFrame 上获得最大值或任何您想记住的内容:

And some way to get the max or whatever you want to memoize on the DataFrame:

val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)

sql.types.Metadata 只能保存字符串、布尔值、某些类型的数字和其他元数据结构.所以我们必须使用 Long:

sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:

val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()

DataFrame.withColumn() 实际上有一个重载，允许在最后提供一个元数据参数，但它莫名其妙地标记为 [private]，所以我们只做它所做的——使用 Column.as(alias, metadata):

DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):

val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)

dfWithMax 现在拥有(一列)您想要的元数据！

dfWithMax now has (a column with) the metadata you want!

dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}

或者以编程方式和类型安全(有点；Metadata.getLong() 和其他不返回 Option 并且可能抛出未找到键"异常):

Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):

dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992

在您的情况下，将最大值附加到列是有意义的，但在将元数据附加到 DataFrame 而不是特定列的一般情况下，您似乎必须采用其他答案描述的包装器路线.

Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.

                        这篇关于有没有办法为 Spark 数据帧添加额外的元数据?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！