This question already has answers here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
                                
                                    (9个答案)
                                
                        
                                在10个月前关闭。
            
                    
我正在使用nodejs和RegExp解析XML文件,但是我找不到从父级提取所有子级的方法,例如,我需要父级PARENT1的所有FormalName =“(。+)”

<TopicSet FormalName="PARENT1">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>


我尝试了这个:

<TopicSet FormalName="PARENT1">(?:(?:\s|\S)*?)TopicType FormalName="(.+)"(?:(?:\s|\S)*?)<\/TopicSet>

但是它仅返回PARENT1的第一个出现(Child1),而不返回Child1,Child2和Child3

https://regex101.com/r/3ESH29/2/

最佳答案

使用正则表达式解析xml是not advisable

您可以使用DOMParser而不是使用正则表达式,例如,使用querySelectorAll在PARENT1中获取FormalName的值:

使用jsdom的示例



let xml = `<TopicSet FormalName="PARENT1">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>`;

let parser = new DOMParser();
let doc = parser.parseFromString(xml, "text/xml");
let res = doc.querySelectorAll("TopicSet[FormalName='PARENT1'] Topic TopicType");
res.forEach(e => console.log(e.getAttribute("FormalName")));

10-04 17:15