问题描述
想要这样的东西https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java并创建一个 Hive UDAF 以创建一个返回数据类型猜测的聚合函数.
Wanted to take something like thishttps://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.javaand create a Hive UDAF to create an aggregate function that returns a data type guess.
Spark 是否已经内置了类似的东西?对于探索数据的新宽数据集非常有用.对机器学习也有帮助,例如决定分类变量还是数值变量.
Does Spark have something like this already built-in?Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.
您通常如何确定 Spark 中的数据类型?
How do you normally determine data types in Spark?
附言像 h2o 这样的框架会自动确定扫描数据样本或整个数据集的数据类型.那么就可以决定例如如果变量应该是分类变量或数值.
P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.
P.P.S.另一个用例是,如果您获得任意数据集(我们经常获得它们),并希望将其另存为 Parquet 表.提供正确的数据类型可以使 parquet 更加节省空间(并且可能具有更高的查询时性能,例如比将所有内容都存储为字符串/varchar 更好的镶木地板布隆过滤器).
P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table.Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g.better parquet bloom filters than just storing everything as string/varchar).
推荐答案
部分.Spark 生态系统中有一些工具可以执行模式推断,例如 spark-csv
或 pyspark-csv
和类别推断(分类与数字)像 VectorIndexer
.
Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv
or pyspark-csv
and category inference (categorical vs. numerical) like VectorIndexer
.
到目前为止一切顺利.问题是模式推断的适用性有限,一般来说不是一项容易的任务,可能会引入难以诊断的问题,并且可能非常昂贵:
So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:
- 可与 Spark 一起使用的格式并不多,并且可能需要模式推断.实际上,它仅限于 CSV 和固定宽度格式数据的不同变体.
根据数据表示,可能无法确定正确的数据类型或推断类型可能导致信息丢失:
- There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:
- 将数字数据解释为浮点数或双精度数会导致无法接受的精度损失,尤其是在处理财务数据时
- 日期或数字格式可能因地区而异
- 一些常见的标识符可能看起来像数字,但其内部结构可能会在转换中丢失
自动模式推断可以掩盖输入数据的不同问题,如果其他工具不支持它可以突出可能的问题,它可能是危险的.此外,数据加载和清理过程中的任何错误都可以通过完整的数据处理管道传播.
Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.
可以说,在我们开始考虑可能的表示和编码之前,我们应该对输入数据有很好的理解.
Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.
架构推断和/或类别推断可能需要完整的数据扫描和/或大型查找表.在大型数据集上,两者都可能代价高昂,甚至不可行.
Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.
编辑:
CSV 文件的架构推断功能似乎已直接添加到 Spark SQL.见 CSVInferSchema
.
It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema
.
这篇关于Spark 数据类型猜测器 UDAF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!