本文介绍了我想加载一个YAML文件,可能编辑数据,然后再次转储它.如何保留格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该问题试图以与语言无关的方式收集有关不同语言和YAML实现的问题的信息.

假设我有一个这样的YAML文件:

 first:
  - foo: {a: "b"}
  - "bar": [1, 2, 3]
second: |   # some comment
  some long block scalar value
 

我想将此文件加载到本机数据结构中,可能会更改或添加一些值,然后再次转储.但是,当我转储它时,不会保留原始格式:

  • 标量的格式不同,例如"b"失去引号,second的值不再是文字块标量,等等.
  • 集合的格式不同,例如foo的映射值以块样式而不是给定的流样式编写,类似地,"bar"的序列值以块样式编写
  • 映射键(例如first/second)的顺序更改
  • 评论不见了
  • 缩进级别有所不同,例如first中的项目不再缩进.

如何保留原始文件的格式?

解决方案

前言:在整个答案中,我提到了一些流行的YAML实现.这些提法从未详尽无遗,因为我不知道那里所有的YAML实现.

我将对数据结构使用YAML术语:原子文本内容(偶数)是标量.项目序列在其他地方称为数组或列表,是序列. 映射.

是一组键值对(在其他地方称为字典或哈希).

如果您使用的是Python,请考虑使用 ruamel (可能会从PyYAML切换),因为它实现了对本机结构的往返访问,因此很多答案都不适用于它.

背景

加载YAML的过程也是丢失信息的过程.让我们看一下规范中给定的YAML加载/转储过程:

加载YAML文件时,您正在加载方向上执行某些或所有步骤,从演示文稿(字符流)开始. YAML实现通常会提升其最高级的API,这些API会一直将YAML文件加载到本地(数据结构).对于大多数常见的YAML实现而言,这都是正确的. PyYAML/ruamel,SnakeYAML,go-yaml和Ruby的YAML模块.由于其实现语言的限制,其他实现(例如libyaml和yaml-cpp)仅提供反序列化,直到 Representation(节点图).

对我们来说,重要的信息是这些框中包含的内容.每个框都在其左侧的框中提到了不再可用的信息.因此,这意味着根据YAML规范,样式注释仅出现在实际的YAML文件内容中,但在YAML文件为解析.对您来说,这意味着一旦将YAML文件加载到本机数据结构中,有关它在输入文件中的原始外观的所有信息都将消失.这意味着,当您转储数据时,YAML实现将选择它认为对数据有用的表示形式.一些实现让您给出一般的提示/选项,例如所有标量都应加引号,但这无助于您恢复原始格式.

值得庆幸的是,此图仅描述了加载YAML的逻辑过程;符合要求的YAML实现无需严格遵循它.实际上,大多数实现都将数据保存的时间超过了所需的时间.对于PyYAML/ruamel,SnakeYAML,go-yaml,yaml-cpp,libyaml等而言,这是正确的.在所有这些实现中,标量,序列和映射的样式都会被记住,直到 Representation(节点图)级别.

另一方面,注释被很快丢弃,因为它们不属于事件或节点(这里的ruamel是例外,它将注释链接到以下事件).一些YAML实现(libyaml,SnakeYAML)提供对令牌流的访问,该令牌流的级别甚至比 Event Tree 低.该令牌流确实包含注释,但是仅可用于执行语法高亮显示等操作,因为API不包含再次消费令牌流的方法.

那该怎么办?

加载和倾销

如果您只需要加载YAML文件然后再次转储,请使用实现的较低级别的API之一仅加载YAML,直到 Representation(节点图)序列化(事件树)级别.要搜索的API函数分别是 compose / parse serialize / present .

最好使用事件树代替节点图,因为当组成. 此问题,例如,详细说明使用SnakeYAML加载/转储事件. >

由于注释很早就被废弃,因此除非您想派生一个现有的YAML实现并对其进行修补以保留注释,否则您几乎没有保留这些注释的可能性(就像ruamel使用PyYAML所做的那样.go-yaml( v3)将注释与节点图中的节点相关联,因此您有机会在一定程度上访问和保留它们.

还请注意,保持风格并非完美,而且不可能做到完美.例如,使用以下标量:

 "1 \x2B 1"
 

在解决转义序列后,此负载作为字符串"1 + 1"加载.即使在事件流中,有关转义序列的信息在我所知道的所有实现中都已经丢失了.该事件仅记住它是双引号标量,因此将其写回将导致:

 "1 + 1"
 

类似地,折叠的标量(以>开头)通常将不记得原始输入中的换行符被折叠为空格字符的位置.

因此,总而言之,通常会保留到事件树中并再次转储:

  • 样式:未引用/引用/块标量,流/块集合(序列和映射)
  • 映射中的键顺序
  • YAML标签

您通常会输:

  • 有关流标量中转义序列和换行符的信息
  • 缩进和非内容间距
  • 评论

如果使用节点图而不是事件树,则可能还会丢失映射中的键顺序.某些API(例如go-yaml)不提供对事件树的访问权限,因此您别无选择,只能使用 Node Graph .

修改数据

如果要修改数据并仍然保留原始格式的功能,则需要处理数据而不将其加载到本机结构中.这通常意味着您要处理标量,序列和映射,而不像您习惯于使用 strings numbers lists 或其他方法目标编程语言提供的结构.

您可以选择处理事件树节点图(假设您的API可以访问它).哪个更好,通常取决于您想做什么:

  • 事件树通常作为事件流提供.对于大型数据而言,这可能会更好,因为您不需要将完整的数据加载到内存中.而是检查每个事件,跟踪您在输入结构中的位置,并相应地进行修改. 此问题的答案显示了如何使用PyYAML的事件API将提供路径和值的项目附加到给定的YAML文件中.
  • 节点图更适合高度结构化的数据,并且如果您在YAML中使用锚和别名,因为它们在那里已被解析,因此更好.与事件不同,在事件中您需要自己跟踪当前位置,数据在此处显示为完整图形,您可以直接进入相关部分(对于事件,您可能需要遍历您不感兴趣的大型子结构全部).

在任何情况下,您都需要了解一些有关YAML类型解析的知识,才能正确使用给定的数据.当您将YAML文件加载到已声明的本机结构(通常是使用静态类型系统的语言,例如Java或Go)中时,如果可能,YAML处理器会将YAML结构映射到该文件.但是,如果没有给出目标类型(通常是脚本语言,如Python或Ruby,但在Java中也可能),则根据节点的内容和样式来推导类型.

由于由于需要保留格式信息而无法进行本机加载,因此将不执行此类型解析.但是,您需要了解两种情况下的工作原理:

  • 当您需要确定标量节点或事件的类型时,例如您有一个标量为42的标量,并且需要知道它是 string 还是 integer .
  • 当您需要创建新事件或节点时,应稍后将其作为特定类型加载.例如.如果您附加 string "42",则必须确保以后不会将其作为 integer 42加载.

我不会在这里讨论所有细节;在大多数情况下,只要将 string 编码为标量,但看起来像其他东西(例如数字),就应该使用带引号的标量.

根据您的实现,您可能会与YAML标签 联系.在YAML文件中很少使用(它们看起来像!!str!!map!!int等),它们包含有关节点的类型信息,该节点可用于包含异构数据的集合中.更重要的是,YAML定义所有没有显式标签的节点都将被分配一个作为类型解析的一部分.在 Node Graph 级别,这可能已经发生,也可能尚未发生.因此,即使原始节点没有标签,在节点数据中也可能会看到节点的标签.

以两个感叹号开头的标签实际上是速记,例如!!strtag:yaml.org,2002:str的简写.您可能会在数据中看到任何一个,因为实现对它们的处理方式大不相同.

对您来说重要的是,当您创建节点或事件时,您可能并且可能还需要分配标签.如果您不希望输出包含显式标签,则将非特定标签!用于非普通标量,将?用于事件级别的其他所有内容.在节点级别,有关是否需要提供已解析标签的信息,请参阅您的实施文档.如果不是,则适用于非特定标签的相同规则.如果文档未提及(很少提及),请尝试一下.

所以总结一下:您可以通过加载事件树节点图来修改数据,然后在获取的数据中添加,删除或修改事件或节点,然后再次将修改后的数据呈现为YAML.根据您要执行的操作,它可以帮助您创建要作为本机结构添加到YAML文件中的数据,将其序列化为YAML,然后再次将其作为 Node Graph 事件树.从那里,您可以将其包含在要修改的YAML文件的结构中.

结论/TL; DR

YAML尚未设计用于此任务.实际上,它已定义为序列化语言,假设您的数据在某种编程语言中被编写为本机数据结构,然后从那里被转储到YAML.但是,实际上,YAML经常用于配置,这意味着您通常要手工编写YAML,然后将其加载到本机数据结构中.

这种对比是在保留格式的同时很难修改YAML文件的原因:YAML格式已被设计为 transient 数据格式,由一个应用程序编写,然后被由另一个(或相同)应用程序加载.在此过程中,保留格式无关紧要.但是,它确实适用于签入版本控制的数据(您希望差异仅包含实际更改了数据的行),以及其他情况下您手工编写YAML的情况,因为您想保持风格一致.

没有完美的解决方案来更改给定YAML文件中的一个数据项并使其他所有内容保持不变.加载YAML文件不会为您提供YAML文件的视图,而是会为您提供其描述的内容.因此,所有不属于所描述内容的内容(最重要的是注释和空格)都很难保存.

如果格式保留对您很重要,并且您不能忍受此答案中的建议所造成的折衷,则YAML不是您的正确工具.

This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.

Suppose I have a YAML file like this:

first:
  - foo: {a: "b"}
  - "bar": [1, 2, 3]
second: |   # some comment
  some long block scalar value

I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:

  • The scalars are formatted differently, e.g. "b" loses its quotation marks, the value of second is not a literal block scalar anymore, etc.
  • The collections are formatted differently, e.g. the mapping value of foo is written in block style instead of the given flow style, similarly the sequence value of "bar" is written in block style
  • The order of mapping keys (e.g. first/second) changes
  • The comment is gone
  • The indentation level differs, e.g. the items in first are not indented anymore.

How can I preserve the formatting of the original file?

解决方案

Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.

I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.

If you are using Python, consider using ruamel (possibly switching from PyYAML) since it implements round-tripping up to native structures and so much of this answer does not apply to it.

Background

The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:

When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph) due to restrictions of their implementation languages.

The important information for us are the things contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means, when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.

Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.

On the other hand, comments are discarded rather quickly since they do not belong to an event or node (the exception here is ruamel which links comments to the following event). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.

So what to do?

Loading & Dumping

If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.

It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.

Since comments are scrapped early on, you don't have much of a possibility to preserve those, unless you want to fork an existing YAML implementation and patch it to preserve comments (like ruamel did it with PyYAML. go-yaml (v3) associates comments with nodes in the node graph, so you have the opportunity to access and preserve them there to some degree.

Also note that keeping style is not perfect and cannot really be. For example, take this scalar:

"1 \x2B 1"

This load as string "1 + 1" after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:

"1 + 1"

Similarly, a folded block scalar (starting with >) will usually not remember where line breaks in the original input have been folded into space characters.

So, to sum up, loading to the Event Tree and dumping again will usually preserve:

  • Style: unquoted/quoted/block scalars, flow/block collections (sequences & mappings)
  • Order of keys in mappings
  • YAML tags

You will usually lose:

  • Information about escape sequences and line breaks in flow scalars
  • Indentation and non-content spacing
  • Comments

If you use the Node Graph instead of the Event Tree, you might additionally lose the key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.

Modifying Data

If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on scalars, sequences and mappings, and not like you may be used to on strings, numbers, lists or whatever structures the target programming language provides.

You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:

  • The Event Tree is usually provided as stream of events. It may be better for large data since you do not need to load the complete data in memory; instead you inspect each event, track your position in the input structure, and place your modifications accordingly. The answer to this question shows how to append items giving a path and a value to a given YAML file with PyYAML's event API.
  • The Node Graph is better for highly structured data, and also if you use anchors and aliases in your YAML because they are resolved there. Unlike with events, where you need to track the current position yourself, the data is presented as complete graph here, and you can just descend into the relevant sections (with events, you possibly need to pipe through large substructures you are not interested in at all).

In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structur to it if it is possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.

Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:

  • When you need to decide on the type of a scalar node or event, e.g. you have a scalar with content 42 and need to know whether that is a string or integer.
  • When you need to create a new event or node that should later be loaded as a specific type. E.g. if you append the string "42", you must ensure that it is not loaded as integer 42 later.

I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.

Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str, !!map, !!int and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.

Tags starting with two exclamation marks are actually shorthands, e.g. !!str is a shorthand for tag:yaml.org,2002:str. You may see either in your data, since implementations handle them quite differently.

Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags ! for non-plain scalars and ? for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.

So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.

Conclusion / TL;DR

YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.

This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.

There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.

If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.

这篇关于我想加载一个YAML文件,可能编辑数据,然后再次转储它.如何保留格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 05:59