问题描述
我想在BigTable上保留一个非常稀疏的Spark Dataframe(> 100,000列)(其中99%的值为空),同时仅保留非null值(以避免存储成本).
I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).
是否有一种方法可以在Spark中指定在写入时忽略空值?
Is there a way to specify in Spark to ignore nulls when writing?
谢谢!
推荐答案
可能(未对其进行测试),在将Spark DataFrame写入HBase/BigTable之前,您可以通过使用以下方法滤除每行中具有空值的列来对其进行转换自定义函数,如此处使用熊猫示例所示: https://stackoverflow.com/a/59641595/3227693 .据我所知,没有内置的连接器支持此功能.
Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.
或者,您可以尝试以Parquet之类的列文件格式存储数据,因为它们是有效地处理稀疏列数据的持久性(至少在以字节为单位的输出大小方面).但是要避免写入许多小文件(由于数据的稀疏性质),这可能会降低写入吞吐量,因此您可能需要在执行写入操作之前减少输出分区的数量(即,每个木地板文件写入更多行:Spark镶木地板分区:大量文件)
Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)
这篇关于Spark HBase/BigTable-宽/稀疏数据帧持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!