问题描述
我正在尝试从S3读取未压缩的节俭文件.到目前为止,它没有起作用.
I'm trying to get spark to read uncompressed thrift files from s3. So far it has not been working.
- 数据在s3中作为未压缩的节俭文件加载.来源是AWS Kinesis Firehose.
- 我有一个可以毫无问题地反序列化文件的工具,所以我知道节俭的序列化/反序列化是可行的.
- 在火花中,我正在使用newAPIHadoopFile
- 使用Elephantbird的LzoThriftBlockInputFormat,我能够成功读取lzo压缩的节俭文件
- 我不知道应该使用哪种InputFormat读取未压缩的节俭文件.
那里的任何InputFormats可能吗?我必须自己实现吗?
Is that possible with any of the InputFormats out there? Do I have to implement my own?
推荐答案
我最终写了自己的自定义节俭解串器.
I ended up writing my own custom thrift deserializer.
需要实现自定义InputFormat和自定义RecordReader.对于某些库中还不存在这样的类,仍然感到惊讶.这两个类已经过测试并且可以正常工作,但是由于我在解决此问题后不久就停止了该项目的工作,因此未清理代码.
Needed to implement a custom InputFormat and custom RecordReader. Still surprised that such classes don't already exist in some lib. The two classes have been tested and work, but since i stopped working on the project soon after i solved this, the code is not cleaned up.
https://github.com/mklosi/thrift-deserializer
这篇关于读取Spark中未压缩的节俭文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!