问题描述
我一直在查看罗斯林CTP ,尽管它解决了与表达式树API 都是不可变的,但罗斯林却以完全不同的方式做到了:
-
Expression
节点没有对父节点的引用,而是使用ExpressionVisitor
进行了修改,这就是可以重用大部分部件的原因. 另一方面, -
Roslyn的
SyntaxNode
引用了其父节点,因此所有节点实际上都是无法重复使用的块.提供了诸如Update
,ReplaceNode
等方法来进行修改.
这在哪里结束? Document
? Project
? ISolution
?该API促进了树的逐步更改(而不是按下按钮),但是是否每个步骤都进行了完整复制?
为什么他们会做出这样的选择?我缺少一些有趣的把戏吗?
更新:此问题为 2012年6月8日我的博客主题.谢谢你提出的好问题!
很好的问题.我们对您提出的问题进行了很长时间的辩论.
我们希望有一个具有以下特征的数据结构:
- 不可变.
- 一棵树的形式.
- 便宜地从子节点访问父节点.
- 可能从树中的节点映射到文本中的字符偏移量.
- 持久.
通过持久性是指对文本缓冲区进行编辑时重用树中大多数现有节点的能力.由于节点是不可变的,因此重用它们没有障碍.我们需要这样做来提高性能;每次您敲键时,我们都无法重新解析文件的巨大杂物.我们只需要重新词法和重新分析仅受编辑影响的树的部分.
现在,当您尝试将所有这五种内容放到一个数据结构中时,您会立即遇到问题:
- 首先如何构建节点?父母和孩子都互相指称,并且是不可变的,那么哪个首先被构建?
- 假设您设法解决了这个问题:如何使其持久化?您不能在其他父节点中重复使用子节点,因为这将告诉子节点其具有新的父节点.但是孩子是一成不变的.
- 假设您设法解决了这个问题:当在编辑缓冲区中插入新字符时,映射到该点之后某个位置的每个节点的绝对位置都会发生变化.这使得建立持久的数据结构非常困难,因为任何编辑都可以更改大多数节点的跨度!
但是在罗斯林团队中,我们通常会做不可能的事情.实际上,我们通过保持两个树来解析不可能的事. 绿色"标志指的是绿色"标志.树是不可变的,持久的,没有父代引用,是自下而上"构建的,每个节点都跟踪其 width 而不是其绝对位置.当进行编辑时,我们仅重建受该编辑影响的绿树部分,通常约占该树中所有解析节点的O(log n).
红色"树是围绕绿色树构建的不变的 facade ;它是自上而下"构建的按需,并在每次编辑时都将其丢弃.当您从顶部向下穿过树时,它会根据需要按需制造父引用来计算它们.当您下降时,它还会根据宽度计算绝对位置,从而制造出绝对位置.
您,用户,只见过红色的树;绿树是一个实现细节.如果您查看解析节点的内部状态,您实际上会发现其中有一个引用另一个解析节点的类型.那是绿树节点.
顺便提及,这些被称为红/绿树".因为这些是我们在设计会议中用于绘制数据结构的白板笔颜色.颜色没有其他含义.这种策略的好处是我们获得了所有这些伟大的东西:不变性,持久性,父引用等等.代价是该系统很复杂,并且如果红色"被删除,则会消耗大量的内存.外墙变大.目前,我们正在做实验,看看是否可以在不损失收益的情况下降低部分成本.
I've been taking a look to Roslyn CTP and, while it solves a similar problem to the Expression tree API, both are immutable but Roslyn does so in a quite different way:
Expression
nodes have no reference to the parent node, are modified using aExpressionVisitor
and that's why big parts can be reused.Roslyn's
SyntaxNode
, on the other side, has a reference to its parent, so all the nodes effectively become a block that's impossible to re-use. Methods likeUpdate
,ReplaceNode
, etc, are provided to make modifications.
Where does this end? Document
? Project
? ISolution
? The API promotes a step-by-step change of the tree (instead of a button up), but does each step makes a full copy?
Why they did they make such a choice? Is there some interesting trick I'm missing?
UPDATE: This question was the subject of my blog on June 8th, 2012. Thanks for the great question!
Great question. We debated the issues you raise for a long, long time.
We would like to have a data structure that has the following characteristics:
- Immutable.
- The form of a tree.
- Cheap access to parent nodes from child nodes.
- Possible to map from a node in the tree to a character offset in the text.
- Persistent.
By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there's no barrier to reusing them. We need this for performance; we cannot be re-parsing huge wodges of the file every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit.
Now when you try to put all five of those things into one data structure you immediately run into problems:
- How do you build a node in the first place? The parent and the child both refer to each other, and are immutable, so which one gets built first?
- Supposing you manage to solve that problem: how do you make it persistent? You cannot re-use a child node in a different parent because that would involve telling the child that it has a new parent. But the child is immutable.
- Supposing you manage to solve that problem: when you insert a new character into the edit buffer, the absolute position of every node that is mapped to a position after that point changes. This makes it very difficult to make a persistent data structure, because any edit can change the spans of most of the nodes!
But on the Roslyn team we routinely do impossible things. We actually do the impossible by keeping two parse trees. The "green" tree is immutable, persistent, has no parent references, is built "bottom-up", and every node tracks its width but not its absolute position. When an edit happens we rebuild only the portions of the green tree that were affected by the edit, which is typically about O(log n) of the total parse nodes in the tree.
The "red" tree is an immutable facade that is built around the green tree; it is built "top-down" on demand and thrown away on every edit. It computes parent references by manufacturing them on demand as you descend through the tree from the top. It manufactures absolute positions by computing them from the widths, again, as you descend.
You, the user, only ever see the red tree; the green tree is an implementation detail. If you peer into the internal state of a parse node you'll in fact see that there is a reference to another parse node in there of a different type; that's the green tree node.
Incidentally, these are called "red/green trees" because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There's no other meaning to the colours.
The benefit of this strategy is that we get all those great things: immutability, persistence, parent references, and so on. The cost is that this system is complex and can consume a lot of memory if the "red" facades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.
这篇关于Roslyn SyntaxNodes是否可以重用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!