javascript - 如何从字符串中解析文档并设置正确的`documentURL`？

我试图用JavaScript（在Firefox中测试）编写一个非常简单的搜寻器。

我使用ES6 fetch函数以这种方式获取文档：

fetch(url)
  .then(response => response.text())
  .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
  .then(doc => {
     doc.querySelectorAll('a').forEach(node => {
       fetch(node.href)
         .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
         .then(doc => {
           doc.querySelectorAll('a').forEach(node => {
             console.log (node.href);
           });
         });
     });
  });

问题是来自MDN的以下内容

当通过调用新的DOMParser（）实例化DOMParser时，它将继承调用代码的主体（对于chrome调用者，主体被设置为null主体）以及构造函数来自的窗口的documentURI和baseURI。

只要URL与窗口的URL相同，第一次提取就可以正常工作。但是，使用querySelectorAll时，我从提取的页面收集了不同的锚，以便也提取那些页面以为每个URL创建DOM树。问题是，由parseFromString创建的DOM树具有错误的documentURL。 parseFromString不采用任何URL参数，而是从documentURL继承window。但这显然是错误的URL。这意味着获取的文档中的所有相对链接都被破坏了。

如何从字符串中解析文档并设置正确的documentURL？

(new DOMParser()).parseFromString('<html></html>', 'text/html')

属性URL和documentURL都是只读的。

最佳答案

您可以尝试这样的事情。只需手动跟踪正确的原点即可。

// Save the origin of the original request.
var origin1 = new URL(url).origin

fetch(url)
  .then(response => response.text())
  .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
  .then(doc => {
     doc.querySelectorAll('a').forEach(node => {
       // Check if node's href is absolute or relative.
       var href = node.getAttribute('href') // use this instead of node.href (node.href is always absolute)
       if (!href.match(/https?:\/\//) {
         // this is a relative url, so
         href = origin1 + href;
       }

       fetch(href)
         .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
         .then(doc => {
           doc.querySelectorAll('a').forEach(node => {
             // See above, check if relative and append to correct
             // origin if so.
             // console.log (node.href);
           });
         });
     });
  });