问题描述
我使用二进制序列化 (BinaryFormatter) 作为临时机制,将状态信息存储在文件中,用于相对复杂的(游戏)对象结构;文件出来的比我预期的要大得多,而且我的数据结构包括递归引用 - 所以我想知道 BinaryFormatter 是否实际上存储了相同对象的多个副本,或者我的基本数字"我应该拥有的对象和值的数量"算法是偏离基础的,或者过大的尺寸来自哪里.
搜索堆栈溢出我能够找到 Microsoft 的二进制远程处理格式的规范:) 流中的每条记录都由 RecordTypeEnumeration
标识.2.1.2.1 RecordTypeNumeration
部分指出:
此枚举标识记录的类型.每条记录(MemberPrimitiveUnTyped 除外)都以记录类型枚举开始.枚举的大小为 1 BYTE.
SerializationHeaderRecord:
所以如果我们回顾一下我们得到的数据,我们可以开始解释第一个字节:
如 2.1.2.1 RecordTypeEnumeration
中所述,0
的值标识 2.6.1 SerializationHeaderRecordSerializationHeaderRecord
/代码>:
SerializationHeaderRecord 记录必须是二进制序列化中的第一条记录.此记录具有格式的主要和次要版本以及顶级对象和标题的 ID.
它包括:
- RecordTypeEnum(1 字节)
- RootId(4 个字节)
- HeaderId(4 个字节)
- 主要版本(4 个字节)
- 次要版本(4 个字节)
有了这些知识,我们可以解释包含 17 个字节的记录:
00
代表 RecordTypeEnumeration
,在我们的例子中是 SerializationHeaderRecord
.
01 00 00 00
代表RootId
如果 BinaryMethodCall 和 BinaryMethodReturn 记录都不存在于序列化流中,则该字段的值必须包含序列化流中包含的类、数组或 BinaryObjectString 记录的 ObjectId.
所以在我们的例子中,这应该是值为 1
的 ObjectId
(因为数据是使用 little-endian 序列化的),我们希望再次看到它;-)
FF FF FF FF
代表HeaderId
01 00 00 00
代表MajorVersion
00 00 00 00
代表 MinorVersion
二进制库:
按照规定,每条记录必须以 RecordTypeEnumeration
开头.随着最后一条记录完成,我们必须假设新的记录开始了.
让我们解释下一个字节:
如我们所见,在我们的示例中,SerializationHeaderRecord
后面是 BinaryLibrary
记录:
BinaryLibrary 记录将一个 INT32 ID(在 [MS-DTYP] 部分 2.2.22 中指定)与一个库名称相关联.这允许其他记录使用 ID 引用库名称.当有多个记录引用相同的库名称时,这种方法可以减少连线大小.
它包括:
- RecordTypeEnum(1 字节)
- LibraryId(4 个字节)
- LibraryName(可变字节数(这是一个
LengthPrefixedString
))
如 2.1.1.6 LengthPrefixedString
...
LengthPrefixedString 代表一个字符串值.该字符串的前缀是 UTF-8 编码字符串的长度(以字节为单位).长度编码在可变长度字段中,最小为 1 个字节,最大为 5 个字节.为了最小化电线尺寸,长度被编码为一个可变长度字段.
在我们的简单示例中,长度始终使用 1 字节
进行编码.有了这些知识,我们可以继续解释流中的字节:
0C
表示 RecordTypeEnumeration
,它标识 BinaryLibrary
记录.
02 00 00 00
代表 LibraryId
,在我们的例子中是 2
.
现在 LengthPrefixedString
如下:
42
表示包含LibraryName
的LengthPrefixedString
的长度信息.
在我们的例子中,42
(十进制 66)的长度信息告诉我们,我们需要读取接下来的 66 个字节并将它们解释为 LibraryName
.>
如前所述,该字符串是 UTF-8
编码的,因此上述字节的结果将类似于:_WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
同样,记录是完整的,所以我们解释下一个的RecordTypeEnumeration
:
05
标识一个 ClassWithMembersAndTypes
记录.2.3.2.1 ClassWithMembersAndTypes
部分指出:
ClassWithMembersAndTypes 记录是 Class 记录中最详细的.它包含有关成员的元数据,包括成员的名称和远程处理类型.它还包含引用类的库名称的库 ID.
它包括:
- RecordTypeEnum(1 字节)
- ClassInfo(可变字节数)
- MemberTypeInfo(可变字节数)
- LibraryId(4 个字节)
类信息:
如2.3.1.1 ClassInfo
所述,记录包括:
- ObjectId(4 个字节)
- 名称(可变字节数(也是
LengthPrefixedString
)) - MemberCount(4 字节)
- MemberNames(它是
LengthPrefixedString
的序列,其中项目的数量必须等于MemberCount
字段中指定的值.)
回到原始数据,一步一步:
01 00 00 00
代表ObjectId
.我们已经看到了这个,它被指定为 SerializationHeaderRecord
中的 RootId
.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41
表示使用 LengthPrefixedStringName
/代码>.如前所述,在我们的示例中,字符串的长度定义为 1 个字节,因此第一个字节 0F
指定必须使用 UTF-8 读取和解码 15 个字节.结果看起来像这样: StackOverFlow.A
- 很明显我使用了 StackOverFlow
作为命名空间的名称.
02 00 00 00
代表 MemberCount
,它告诉我们后面有 2 个成员,都用 LengthPrefixedString
表示.
第一位成员姓名:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
,代表第一个名称1B
也是字符串的长度,它的长度为 27 个字节,结果如下:k__BackingField
.
第二名成员姓名:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
代表第二个名字M,
Mcode>1A
指定字符串为 26 字节长.结果如下:k__BackingField
.
会员类型信息:
在 ClassInfo
之后是 MemberTypeInfo
.
2.3.1.2 - MemberTypeInfo
部分指出,该结构包含:
- BinaryTypeEnums(长度可变)
表示正在传输的成员类型的 BinaryTypeEnumeration 值序列.数组必须:
具有与 ClassInfo 结构的 MemberNames 字段相同数量的项目.
排序使得 BinaryTypeEnumeration 对应于 ClassInfo 结构的 MemberNames 字段中的成员名称.
- AdditionalInfos(长度可变),取决于
BinaryTpeEnum
附加信息可能存在也可能不存在.
|原始 |PrimitiveTypeEnumeration |
|字符串 |无 |
所以考虑到这一点,我们快到了...我们期望 2 个 BinaryTypeEnumeration
值(因为我们在 MemberNames
中有 2 个成员).
再次回到完整的 MemberTypeInfo
记录的原始数据:
01
代表第一个成员的BinaryTypeEnumeration
,根据2.1.2.2 BinaryTypeEnumeration
我们可以期待一个String
> 并使用 LengthPrefixedString
表示.
00
代表第二个成员的BinaryTypeEnumeration
,同样,根据规范,它是一个Primitive
.如上所述,Primitive
后面是附加信息,在本例中为 PrimitiveTypeEnumeration
.这就是为什么我们需要读取下一个字节,即 08
,将其与 2.1.2.3 PrimitiveTypeEnumeration
中所述的表进行匹配,并惊讶地注意到我们可以期待 08
code>Int32 由 4 个字节表示,如其他一些关于基本数据类型的文档所述.
图书馆 ID:
在MemerTypeInfo
之后是LibraryId
,用4个字节表示:
02 00 00 00
表示 LibraryId
为 2.
价值观:
如2.3 Class Records
中所述:
类成员的值必须序列化为该记录之后的记录,如第 2.7 节中所述.记录的顺序必须与 ClassInfo(第 2.3.1.1 节)结构中指定的 MemberName 的顺序相匹配.
这就是为什么我们现在可以期待成员的价值.
让我们看看最后几个字节:
06
标识一个 BinaryObjectString
.它代表了我们的 SomeString
属性的值(准确地说是 k__BackingField
).
根据2.5.7 BinaryObjectString
,它包含:
- RecordTypeEnum(1 字节)
- ObjectId(4 个字节)
- 值(可变长度,表示为
LengthPrefixedString
)
所以知道这一点,我们可以清楚地识别
03 00 00 00
代表ObjectId
.
03 61 62 63
表示 Value
其中 03
是字符串本身的长度,61 62 63
> 是转换为 abc
的内容字节.
希望你还记得有第二个成员,Int32
.知道 Int32
用 4 个字节表示,我们可以得出结论,
必须是我们第二个成员的Value
.7B
十六进制等于 123
十进制,这似乎适合我们的示例代码.
这里是完整的 ClassWithMembersAndTypes
记录:
消息结束:
最后一个字节0B
代表MessageEnd
记录.
I'm using binary serialization (BinaryFormatter) as a temporary mechanism to store state information in a file for a relatively complex (game) object structure; the files are coming out much larger than I expect, and my data structure includes recursive references - so I'm wondering whether the BinaryFormatter is actually storing multiple copies of the same objects, or whether my basic "number of objects and values I should have" arithmentic is way off-base, or where else the excessive size is coming from.
Searching on stack overflow I was able to find the specification for Microsoft's binary remoting format:http://msdn.microsoft.com/en-us/library/cc236844(PROT.10).aspx
What I can't find is any existing viewer that enables you to "peek" into the contents of a binaryformatter output file - get object counts and total bytes for different object types in the file, etc;
I feel like this must be my "google-fu" failing me (what little I have) - can anyone help? This must have been done before, right??
UPDATE: I could not find it and got no answers so I put something relatively quick together (link to downloadable project below); I can confirm the BinaryFormatter does not store multiple copies of the same object but it does print quite a lot of metadata to the stream. If you need efficient storage, build your own custom serialization methods.
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A
which contains 2 properties, one string and one integer value, they are called SomeString
and SomeValue
.
Class A
looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter
of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A
containing abc
and 123
as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration
. Section 2.1.2.1 RecordTypeNumeration
states:
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration
a value of 0
identifies the SerializationHeaderRecord
which is specified in 2.6.1 SerializationHeaderRecord
:
It consists of:
- RecordTypeEnum (1 byte)
- RootId (4 bytes)
- HeaderId (4 bytes)
- MajorVersion (4 bytes)
- MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00
represents the RecordTypeEnumeration
which is SerializationHeaderRecord
in our case.
01 00 00 00
represents the RootId
So in our case this should be the ObjectId
with the value 1
(because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF
represents the HeaderId
01 00 00 00
represents the MajorVersion
00 00 00 00
represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration
. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord
it is followed by the BinaryLibrary
record:
It consists of:
- RecordTypeEnum (1 byte)
- LibraryId (4 bytes)
- LibraryName (variable number of bytes (which is a
LengthPrefixedString
))
As stated in 2.1.1.6 LengthPrefixedString
...
In our simple example the length is always encoded using 1 byte
. With that knowledge we can continue the interpretation of the bytes in the stream:
0C
represents the RecordTypeEnumeration
which identifies the BinaryLibrary
record.
02 00 00 00
represents the LibraryId
which is 2
in our case.
Now the LengthPrefixedString
follows:
42
represents the length information of the LengthPrefixedString
which contains the LibraryName
.
In our case the length information of 42
(decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName
.
As already stated, the string is UTF-8
encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration
of the next one:
05
identifies a ClassWithMembersAndTypes
record. Section 2.3.2.1 ClassWithMembersAndTypes
states:
It consists of:
- RecordTypeEnum (1 byte)
- ClassInfo (variable number of bytes)
- MemberTypeInfo (variable number of bytes)
- LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo
the record consists of:
- ObjectId (4 bytes)
- Name (variable number of bytes (which is again a
LengthPrefixedString
)) - MemberCount(4 bytes)
- MemberNames (which is a sequence of
LengthPrefixedString
's where the number of items MUST be equal to the value specified in theMemberCount
field.)
Back to the raw data, step by step:
01 00 00 00
represents the ObjectId
. We've already seen this one, it was specified as the RootId
in the SerializationHeaderRecord
.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41
represents the Name
of the class which is represented by using a LengthPrefixedString
. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F
specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A
- so obviously I used StackOverFlow
as name of the namespace.
02 00 00 00
represents the MemberCount
, it tell's us that 2 members, both represented with LengthPrefixedString
's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
represents the first MemberName
, 1B
is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField
.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
represents the second MemberName
, 1A
specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField
.
MemberTypeInfo:
After the ClassInfo
the MemberTypeInfo
follows.
Section 2.3.1.2 - MemberTypeInfo
states, that the structure contains:
- BinaryTypeEnums (variable in length)
- AdditionalInfos (variable in length), depending on the
BinaryTpeEnum
additional info may or may not be present.
So taking that into consideration we are almost there...We expect 2 BinaryTypeEnumeration
values (because we had 2 members in the MemberNames
).
Again, back to the raw data of the complete MemberTypeInfo
record:
01
represents the BinaryTypeEnumeration
of the first member, according to 2.1.2.2 BinaryTypeEnumeration
we can expect a String
and it is represented using a LengthPrefixedString
.
00
represents the BinaryTypeEnumeration
of the second member, and again, according to the specification, it is a Primitive
. As stated above, Primitive
's are followed by additional information, in this case a PrimitiveTypeEnumeration
. That's why we need to read the next byte, which is 08
, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration
and be surprised to notice that we can expect an Int32
which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo
the LibraryId
follows, it is represented by 4 bytes:
02 00 00 00
represents the LibraryId
which is 2.
The values:
As specified in 2.3 Class Records
:
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06
identifies an BinaryObjectString
. It represents the value of our SomeString
property (the <SomeString>k__BackingField
to be exact).
According to 2.5.7 BinaryObjectString
it contains:
- RecordTypeEnum (1 byte)
- ObjectId (4 bytes)
- Value (variable length, represented as a
LengthPrefixedString
)
So knowing that, we can clearly identify that
03 00 00 00
represents the ObjectId
.
03 61 62 63
represents the Value
where 03
is the length of the string itself and 61 62 63
are the content bytes that translate to abc
.
Hopefully you can remember that there was a second member, an Int32
. Knowing that the Int32
is represented by using 4 bytes, we can conclude, that
must be the Value
of our second member. 7B
hexadecimal equals 123
decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes
record:
MessageEnd:
Finally the last byte 0B
represents the MessageEnd
record.
这篇关于如何分析二进制序列化流的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!