python - Python3:如何根据h标签的级别将纯HTML转换为嵌套字典？

我有一个看起来像这样的html：

<h1>Sanctuary Verses</h1>
    <h2>Purpose and Importance of the Sanctuary</h2>
       <p>Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.</p>
       <p>...</p>
    <h2>Some other title</h2>
        <p>...</p>
         <h3>sub-sub-title</h3>
             <p>sub-sub-content</p>
    <h2>Some different title</h2>
        <p>...</p>...

没有将div标记分组的section或p标记。它可以很好地用于显示目的。我想提取数据以获得所需的输出。

所需输出：

h标记应显示为标题，并根据其级别嵌套
p标记应添加到h标记所指定的特定标题下的内容中

所需输出：

{
  "title": "Sanctuary Verses"
  "contents": [
    {"title": "Purpose and Importance of the Sanctuary"
     "contents":["Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.",
                 "...."
                ]
    },
    {"title": "Some other title"
     "contents": ["...",
                 {"title": "sub-sub-title"
                  "content": ["sub-sub-content"]
                 }
                 ]
    },
    {"title": "Some different title"
     "content": ["...","..."]
    }
}

我编写了一些变通方法代码，这些代码可以帮助我获得所需的输出。我想知道哪种是获得所需输出的最简单方法

最佳答案

这有点像堆栈问题/图形问题。让我们称它为树。（或文档等）。

我认为您的初始元组可以改进。（文字，深度，类型）

stack = []
depth = 0
broken_value = -1
current = {"title":"root", "contents":[]}
for item in list_of_tuples:
    if item[1]>depth:
         #deeper
         next = { "title":item[0], "contents":[]  }
         current["contents"].append(next)
         stack.append(current)
         current=next
         depth = item[1]
    elif item[1]<depth:
         #shallower closes current gets previous level
         while depth>item[1]:
             prev = stack.pop()
             depth = depth-1
         current = {"title":item[0], "content":[]}
         stack[-1].append(current)
         depth=item[1]
    else:
         #same depth
         if item[2]==broken_value:
             #<p> element gets added to current level.
             current['contents'].append(item[0])
         else:
             #<h> element gets added to parent of current.
             current = {"title":item[0], "content":[]}
             stack[-1]["contents"].append(current)
    broken_value = item[2]

这将创建一个任意深度图，该图假定深度增加1，但
可以减少任意数量。

最好跟踪字典中的深度，以便一次可以移动多个深度。不仅是“标题”和“内容”，还可能是“标题”，“深度”和“内容”

说明
堆栈跟踪打开的元素，而当前元素是我们正在构建的元素。

如果找到的深度大于当前深度，则将当前元素放到堆栈上（它仍处于打开状态）并开始处理下一级元素。

如果深度小于当前元素，我们将关闭当前元素和父元素直到相同的深度。

最后，如果深度相同，我们决定是添加一个“ p”元素，还是关闭电流并开始一个新电流的另一个“ h”。

关于python - Python3:如何根据h标签的级别将纯HTML转换为嵌套字典？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59929011/