h5 { text-indent: 0.71cm; margin-top: 0.49cm; margin-bottom: 0.51cm; direction: ltr; color: #000000; line-height: 155%; text-align: justify; page-break-inside: avoid; orphans: 0; widows: 0 }
h5.western { font-family: "Times New Roman", serif; font-size: 14pt }
h5.cjk { font-family: "宋体"; font-size: 14pt }
h5.ctl { font-family: "Times New Roman", serif; font-weight: normal }
h4 { text-indent: 0.71cm; margin-top: 0.49cm; margin-bottom: 0.51cm; direction: ltr; color: #000000; line-height: 155%; text-align: justify; page-break-inside: avoid; orphans: 0; widows: 0 }
h4.western { font-family: "Arial", sans-serif; font-size: 14pt }
h4.cjk { font-family: "黑体"; font-size: 14pt }
h4.ctl { font-family: "Arial", sans-serif; font-size: 10pt; font-weight: normal }
h3 { text-indent: 0.71cm; margin-top: 0.46cm; margin-bottom: 0.46cm; direction: ltr; color: #000000; line-height: 172%; text-align: justify; page-break-inside: avoid; orphans: 0; widows: 0 }
h3.western { font-family: "Times New Roman", serif; font-size: 16pt }
h3.cjk { font-family: "宋体"; font-size: 16pt }
h3.ctl { font-family: "Times New Roman", serif; font-size: 10pt; font-weight: normal }
h2 { margin-top: 0.46cm; margin-bottom: 0.46cm; direction: ltr; color: #000000; line-height: 172%; text-align: justify; page-break-inside: avoid; orphans: 0; widows: 0 }
h2.western { font-family: "Arial", sans-serif; font-size: 16pt }
h2.cjk { font-family: "黑体"; font-size: 16pt }
h2.ctl { font-family: "Arial", sans-serif; font-size: 10pt; font-weight: normal }
p { margin-bottom: 0.25cm; direction: ltr; color: #000000; line-height: 120%; text-align: justify; orphans: 0; widows: 0 }
p.western { font-family: "Times New Roman", serif; font-size: 10pt }
p.cjk { font-family: "宋体"; font-size: 10pt }
p.ctl { font-family: "Times New Roman", serif; font-size: 10pt }
h1 { margin-top: 0.6cm; margin-bottom: 0.58cm; direction: ltr; color: #000000; line-height: 200%; text-align: justify; page-break-inside: avoid; orphans: 0; widows: 0 }
h1.western { font-family: "Times New Roman", serif; font-size: 22pt }
h1.cjk { font-family: "宋体" }
h1.ctl { font-family: "Times New Roman", serif; font-size: 10pt; font-weight: normal }
p { margin-bottom: 0.25cm; direction: ltr; color: #000000; line-height: 120%; text-align: justify; orphans: 0; widows: 0 }
p.western { font-family: "Times New Roman", serif; font-size: 10pt }
p.cjk { font-family: "宋体"; font-size: 10pt }
p.ctl { font-family: "Times New Roman", serif; font-size: 10pt }

Apache Avro# 1.8.2 Specification

3
数据序列化(Data
Serialization)

Avro数据总是用它的schema来序列化。存储Avro数据的文件应该总是在同一文件中包含数据对应的schema。基于Avro的RPC系统必须保证远端接收者有一份写入数据时所用的schema。

由于写入数据时所用的schema在读取时总是可以获取的，Avro数据本身不带有类型信息。解析数据时需要schema。

通常，序列化和反序列化都按照深度优先，从左到右的顺序遍历schema，当遇到基本类型时直接序列化。

3.1
编码

Avro指定两种序列化编码：二进制(binary)和JSON。大多数应用程序会使用二进制编码，因为它更小更快。但是，对于调试和基于web的应用程序，采用JSON编码有时是比较合适的。

3.2
二进制编码

3.2.1
基本类型

基本类型的二进制编码如下：

null写入0字节
boolean写入1字节，其值为0（false）或1（true）
int和long写入时使用变长的zig-zag编码。例如：

value	hex
0	00
-1	01
1	02
-2	03
2	04
...
-64	7f
64	80
...

float写入4字节。float被转换成32位整数，使用一种类似于java
floatToIntBits的方法，再以little-endian格式编码。

double写入8字节。double被转换成64位整数，使用的方法类似于java的doubleToLongBits，然后以little-endian格式编码。

bytes被编码成一个long型值后面跟随多个字节的数据。
string被编码成一个long型值后面跟随多个字节的UTF-8编码的字符数据。

例如，3个字符的字符串"foo"
将被编码为long值3（编码为十六进制06）跟随UTF-8

编码的f
o和o（十六进制字节66
6f 6f）

3.2.2
复合类型

复合类型的二进制编码如下

3.2.2.1
Records

record按照声明时的顺序对字段的值进行编码。换句话说，record的编码正是与它的字段的编码是相关联的。字段值按照各自的schema编码。

例如,record的schema如下：

{

"type": "record",

"name": "test",

"fields" : [

{"name": "a", "type": "long"},

{"name": "b", "type": "string"}

]

}

这个schema的一个实例，其a字段的值为27（编码为十六进制36），b字段的值为"foo"（编码为十六进制的06
66 6f 6f），实例的编码只是这些字段的级联，即十六进制字节序列：

36 06 66 6f 6f

3.2.2.2
Enums

枚举用一个int来编码，表示symbol在schema中的位置（位置从0开始）

例如，考虑如下enum

{"type": "enum",
"name": "Foo", "symbols": ["A",
"B", "C", "D"] }

这将由一个在0到3之间取值的int值编码，0表示A，3表示D

3.2.2.3
Arrays

数组被编码成一系列的块。每个块包含一个long型计数值，后面跟随计数值个数组项。计数值为0的块指示数组的结束。每一项都按照数组项的schema进行编码。

如果块的计数是负数，则使用它的绝对值，计数后面紧跟一个long型的块大小（block
size），指示块的字节数。这个块大小允许快速跳过数据，例如将record投影到它的字段的一个子集时。

例如，数组的schema

{"type": "array",
"items": "long"}

一个包含3和27的数组可以编码为long值2（编码为十六进制04）紧跟long值3和27（编码为06
36），以0结束：

04 06 36 00

块形式的表示法允许读写超过内存缓冲区大小的数组，因为在不需要知道数组的完整长度的情况下就可以写入数组的项。

3.2.2.4
Maps

map被编码为一系列的块。每个块包含一个long型计数值，后面跟随计数值个key/value对。一个计数为0的块指示map的结束。每个项按照map值的schema进行编码。

如果块的计数值是负数，则使用它的绝对值，计数值后紧跟一个long型块大小指示块的字节数。这个块大小允许快速跳过数据，例如将record投影到它的字段的一个子集时。

块形式的表示法允许读写超过内存缓冲区大小的map，因为在不需要知道map的完整长度的情况下就可以写入map的项。

3.2.2.5
Unions

union被编码为：首先是一个long型值指示union值在其schema中的位置（从0开始计数）。然后根据union中指示位置处的schema编码union的值。

例如，union
schema ["null","string"] 将会编码为：

null 编码为0
（null在union中的位置）：

字符串“a”编码为1（string在union中的位置，编码为十六进制02），随后是字符串的编码：

02 02 61

3.2.2.6
Fixed

Fixed实例使用schema中声明的字节数进行编码。

3.3.
JSON编码

除union外，JSON编码与用于字段默认值的编码相同。

union值被编码为JSON如下：

如果它的类型是null，则它被编码为JSON
null

否则，它被编码为一个包含一个name/value对的JSON对象，name为类型的名称，

value是递归编码的值。对于Avro的命名类型（record
fixed enum）采用用户指定的名称，

对于其他类型采用类型的名称。

例如，union
schema ["null","string","Foo"],
Foo是一个record名，将会被编码为

null
编码为null
字符串"a"
编码为{"string":"a"}
一个Foo实例编码为{"Foo":{....}}
, {....}指示Foo实例的JSON编码

注意，仍然需要一个schema来正确处理JSON编码的数据。例如，JSON编码不能区分int和long，float和double,records和maps，enums和字符串等。

3.4
单一对象编码(Single-object
encoding)

在某些情况下，一个单一Avro序列化的对象需要长期存储。一个常见的例子是将Avro
records储存在Apache
Kafka topic中几周。

当一个schema发生改变后的一段时间内，这种持久化系统将包含使用不同schema编码的记录。因此需要知道编码record使用了哪个schema来支持schema的演进。大多数情况下，schema大到无法包含在消息中，因此儿进制包装格式可以更有效的支持用例。

3.4.1.
单一对象编码规范

单一Avro对象编码如下：

一个两字节标记，C3
01,表明消息是Avro和使用该单一记录（single-record）格式（版本1）

对象schema的8字节little-endian
CRC-64-AVRO

使用Avro二进制编码的Avro对象。

使用2字节标记的实现来确定是否是AVRO。这个检查可以帮助避免当消息不是用Avro编码时所做的无效查找----通过指纹(fingerprint)决定schema

Avro