问题描述
是否可以向 DataFrame
s 添加额外的元数据?
Is it possible to add extra meta data to DataFrame
s?
我有 Spark DataFrame
需要保留额外信息.示例:一个 DataFrame
,我想记住"整数 id 列中使用次数最多的索引.
I have Spark DataFrame
s for which I need to keep extra information. Example: A DataFrame
, for which I want to "remember" the highest used index in an Integer id column.
我使用单独的 DataFrame
来存储这些信息.当然,单独保存这些信息既乏味又容易出错.
I use a separate DataFrame
to store this information. Of course, keeping this information separately is tedious and error-prone.
是否有更好的解决方案在 DataFrame
上存储此类额外信息?
Is there a better solution to store such extra information on DataFrame
s?
推荐答案
扩展和 Scala-fy nealmcb 的答案(问题被标记为 scala,而不是 python,所以我不认为这个答案会偏离主题或多余),假设您有一个 DataFrame:
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
还有一些方法可以在 DataFrame 上获得最大值或任何您想记住的内容:
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata
只能保存字符串、布尔值、某些类型的数字和其他元数据结构.所以我们必须使用 Long:
sql.types.Metadata
can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() 实际上有一个重载,允许在最后提供一个元数据参数,但它莫名其妙地标记为 [private],所以我们只做它所做的——使用 Column.as(alias, metadata)代码>:
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata)
:
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax
现在拥有(一列)您想要的元数据!
dfWithMax
now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
或者以编程方式和类型安全(有点;Metadata.getLong() 和其他不返回 Option 并且可能抛出未找到键"异常):
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
在您的情况下,将最大值附加到列是有意义的,但在将元数据附加到 DataFrame 而不是特定列的一般情况下,您似乎必须采用其他答案描述的包装器路线.
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
这篇关于有没有办法为 Spark 数据帧添加额外的元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!