问题描述
我是ORC文件的新手。我经历了许多博客,但没有得到清楚的理解。请帮助并澄清以下问题。
-
我可以从ORC文件中获取模式吗?我知道在Avro中,模式可以提取。 实际上如何提供架构演变?我知道可以添加几列。但如何去做。我知道,创建orc文件是通过将数据加载到以orc格式存储数据的hive表中的。
ORC文件索引的工作原理是什么?我所知道的是每个条纹索引都将保持不变。但是,由于文件没有排序,它有助于查找条带列表中的数据。它如何帮助您在查找数据时跳过条带?
索引是为每列保留的。如果是的话,那么它是不是会消耗更多的内存?
列格式ORC文件如何适合配置表格,其中每列的值存储在一起。而配置单元表通过记录获取记录。两者如何结合在一起?
strong>和 2。使用Hive和/或HCatalog在Hive Metastore中创建,读取和更新ORC表格结构(HCatalog只是一个侧门,而不是启用Pig / Sqoop / Spark /直接访问metastore) Can I fetch schema from ORC file? I know in Avro, schema can fetched.
How it actually provides schema evolution? I know that few columns can be added. But how to do it. The only I know, creating orc file is by loading data into hive table which store data in orc format.
How ORC files index works? What I know is for every stripe index will be maintained. But as file is not sorted how it helps looking up data in list of stripes. How it helps in skipping stripes while looking up for the data?
Is index maintained for every column. If yes, then is it not going to consume more memory?
How columnar format ORC file can fit into hive table, where values of each columns are stored together. whereas hive table is made to fetch record by record. How both will fit together?
2. ALTER TABLE
命令允许添加/删除列,无论存储类型,包括ORC。但是要小心一个讨厌的bug ,它可能会导致向量化读取后的崩溃(至少在V0.13和V0.14中)
3 和 4。术语索引是相当不恰当的。基本上它只是在写入时在条纹页脚中保留了最小/最大信息,然后在读取时用于跳过明显不符合 WHERE
要求的所有条纹,从而大大减少了I / O在某些情况下(这是一种在MySQL专栏商店中受欢迎的技巧,例如InfoBright在MySQL中,而且在Oracle Exadata设备中也被称为智能扫描)
5。 Hive与行存储格式(文本,SequenceFile,AVRO)和列存储格式(ORC,Parquet)一起使用。优化器只是在初始的Map阶段使用特定的策略和快捷方式 - 例如条纹消除,向量化运算符 - 当然,序列化/反序列化阶段对列存储更加详细。
I am new to ORC file. I went through many blogs, but didn't get clear understanding. Please help and clarify below questions.
1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)
2. ALTER TABLE
command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)
3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE
requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])
5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.
这篇关于Hadoop ORC文件 - 工作原理 - 如何获取元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!