本文介绍了当我使用"STORED AS AVRO"子句创建配置单元表时,Avro模式存储在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

创建具有Avro数据支持的配置单元表的方法至少有两种:

There are at least two different ways of creating a hive table backed with Avro data:

1)基于Avro模式(在此示例中存储在hdfs中)创建表:

1) Creating a table based on an Avro schema (in this example stored in hdfs):

CREATE TABLE users_from_avro_schema
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro/schema/user.avsc');

2)通过使用STORED AS AVRO子句明确指定配置单元列来创建表:

2) Creating a table by specifying hive columns explicitly with STORED AS AVRO clause:

CREATE TABLE users_stored_as_avro(
  id INT,
  name STRING
) STORED AS AVRO;

我是否纠正在第一种情况下users_from_avro_schema表的元数据没有存储在Hive Metastore中,而是从读取avro模式文件的SERDE类推断出来的吗?或者也许表元数据存储在Metastore中,并在创建表时添加,但是将蜂巢元数据与Avro模式同步的策略是什么?我的意思是两种情况:更新表元数据(添加/删除列)和通过更改avro.schema.url属性来更新Avro模式.

Am I correct that in the first case metadata of users_from_avro_schema table are not stored in Hive Metastore, but inferred from SERDE class reading the avro schema file? Or maybe table metadata are stored in the Metastore, added on table's creation, but what is the policy then for synchronising hive metadata with the Avro schema? I mean both cases: updating table metadata (adding/removing columns) and updating Avro schema by changing avro.schema.url property.

在第二种情况下,当我调用DESCRIBE FORMATTED users_stored_as_avro时,没有定义avro.schema.*属性,因此我不知道使用哪个Avro模式读取/写入数据,它是根据存储在表中的表元数据动态生成的元商店?

In the second case when I call DESCRIBE FORMATTED users_stored_as_avro there is no avro.schema.* property defined, so I don't know which Avro schema is used to read/write data, is it generated dynamically based on table's metadata stored in the Metastore?

HIVE-4703 会删除此from deserializer信息表单列注释.那么,如何检查给定表(Metastore或Avro模式)的列类型的来源呢?

This fragment of Programming Hive book says about inferring info about columns from SerDe class, but on the other hand HIVE-4703 removes this from deserializer info form columns comments. How can I check then what is the source of column types for a given table (Metastore or Avro schema)?

推荐答案

我决定发布@DuduMarkovitz给出的补充答案.

I decided to publish a complementary answer to those given by @DuduMarkovitz.

为使代码示例更简洁,让我们澄清一下STORED AS AVRO子句等效于以下三行:

To make code examples more concise let's clarify that STORED AS AVRO clause is an equivalent of these three lines:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

然后让我们看一下创建表以引用存储在hdfs中的avro模式的表时会发生什么.这是模式:

Let's take a look then at what happens when we create a table giving a reference to avro schema stored in hdfs. Here is the schema:

{
  "namespace": "io.sqooba",
  "name": "user",
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"}
  ]
}

我们使用以下命令创建表:

We create our table with the following command:

CREATE TABLE users_from_avro_schema
STORED AS AVRO
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/tulinski/user.avsc');

Hive已正确推断出架构,我们可以通过调用以下内容来查看:

Hive has inferred the schema properly, which we can see by calling:

hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string

Hive Metastore向我们显示了相同的内容(我使用@DuduMarkovitz的查询):

Hive Metastore shows us the same (I use @DuduMarkovitz's query):

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
+------------------------+-------------+-------------+-----------+

到目前为止,一切都很好,一切正常.但是,让我们看看更新avro.schema.url属性以指向架构的下一个版本(users_v2.avsc)时会发生什么,如下所示:

So far, so good, everything works as we expect.But let's see what happens when we update avro.schema.url property to point to the next version of our schema (users_v2.avsc), which is as follows:

{
  "namespace": "io.sqooba",
  "name": "user",
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default":null}
  ]
}

我们只是添加了另一个名为email的字段.
现在,我们更新一个表属性,该属性指向hdfs中的avro模式:

We simply added another field called email.
Now we update a table property pointing to the avro schema in hdfs:

ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');

表格元数据是否已更改?

Has table metadata been changed?

hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
email                   string

是的,太酷了!但是,您是否期望Hive Metastore包含此附加列?
不幸的是,在Metastore中没有任何更改:

Yeah, cool! But do you expect that Hive Metastore contains this additional column?
Unfortunately in Metastore nothing changed:

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
+------------------------+-------------+-------------+-----------+

我怀疑Hive具有以下推断模式的策略:它试图从为给定表指定的SerDe类中获取它.当SerDe无法提供架构时,Hive会查看元存储.
让我们通过删除avro.schema.url属性进行检查:

I suspect that Hive has the following strategy of inferring schema: It tries to get it from a SerDe class specified for a given table. When SerDe cannot provide the schema Hive looks into the metastore.
Let's check that by removing avro.schema.url property:

hive> ALTER TABLE users_from_avro_schema UNSET TBLPROPERTIES ('avro.schema.url');
OK
Time taken: 0.33 seconds
hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
Time taken: 0.363 seconds, Fetched: 2 row(s)

描述向我们显示了存储在Metastore中的数据.让我们通过添加列来修改它们:

Describe shows us data stored in the Metastore. Let's modify them by adding a column:

ALTER TABLE users_from_avro_schema ADD COLUMNS (phone string);

它当然会更改Hive Metastore:

It of course changes Hive Metastore:

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
| users_from_avro_schema | phone       |           2 | string    |
+------------------------+-------------+-------------+-----------+

但是当我们再次将avro.schema.url设置回user_v2.avsc时,Hive Metastore中的内容就不再重要了:

But when we set avro.schema.url again back to user_v2.avsc what is in Hive Metastore doesn't matter any more:

hive> ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');
OK
Time taken: 0.268 seconds
hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
email                   string

Avro模式优先于Metastore.

Avro schema takes precedence over the Metastore.

上面的示例表明,我们应该避免将hive模式更改与avro模式演变混合在一起,因为否则我们很容易陷入Hive Metastore与读取和写入数据时使用的实际模式之间的大混乱和不一致.当我们通过更新avro.schema.url属性来更改avro模式定义时,会出现第一个不一致的地方,但是如果我们知道Hive的策略推断策略,那么我们可以忍受.我还没有检查Hive的源代码是否怀疑我对模式逻辑的正确性,但是上面的示例使我确信下面会发生什么.

The above example shows that we should rather avoid mixing hive schema changes with avro schema evolution, because otherwise we can easily get into big mess and inconsistency between Hive Metastore and actual schema which is used while reading and writing data. The first inconsistency occurs when we change our avro schema definition by updating avro.schema.url property, but we can live with that if we are aware of Hive strategy of inferring schema. I haven't checked in Hive's source code whether my suspicions about schema logic are correct, but the above example convince me what happens underneath.

我扩展了答案,以表明即使在Avro模式和符合Avro模式的Hive Metastore数据之间存在冲突时,也可以读取.请再看一看我上面的例子.我们的表定义指向具有三个字段的avro模式:

I extended my answer to show that even when there is a conflict between Avro schema and Hive Metastore data which comply Avro schema can be read.Please have a look again at my example above. Our table definition points to avro schema having three fields:

id    int
name  string
email string

而在Hive Metastore中,则包含以下列:

whereas in Hive Metastore there are the following columns:

id    int
name  string
phone string

电子邮件与电话
让我们创建一个包含单个符合user_v2.avsc架构的用户记录的avro文件.这是它的json表示形式:

email vs phone
Let's create an avro file containing a single user record complying user_v2.avsc schema. This is its json representation:

{
  "id": 123,
  "name": "Tomek",
  "email": {"string": "tomek@tomek"}
}

要创建avro文件,请调用:

To create the avro file we call:

java -jar avro-tools-1.8.2.jar fromjson --schema-file user_v2.avsc user_tomek_v2.json > user_tomek_v2.avro

尽管Hive Metastore不包含email列,而是包含phone列,我们仍然可以查询表:

We are able to query our table despite the fact that Hive Metastore doesn't contain email column and it contains phone column instead:

hive> set hive.cli.print.header=true;
hive> select * from users_from_avro_schema;
OK
users_from_avro_schema.id   users_from_avro_schema.name users_from_avro_schema.email
123 Tomek   tomek@tomek

这篇关于当我使用"STORED AS AVRO"子句创建配置单元表时,Avro模式存储在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-27 16:46