问题描述
我正在尝试递归地映射网页,例如将所有页面都放在一个网页上。所有网页都在一个文件夹的子文件夹中,因此我可以使用wget轻松镜像所有网页:
I'm trying to mirror a webpage recursively, e.g. getting all pages on one webpage. All webpages are in subfolders of just one folder, therefore I could easily mirror all webpages using wget:
wget --mirror --recursive-页面要求--adjust-extension --no-parent --convert-links https://www.example.com/
但是,该页面在执行某些JS脚本之前已被镜像,并且那些JS脚本不会被镜像。我也需要以某种方式镜像它们,因为它们会更改网页的DOM。另一种选择是等待网站完成加载并镜像加载的网页(任务不是时间紧迫的任务)。
However, the page is mirrored before some JS scripts are executed, and those JS scripts don't get mirrored. I need to mirror them too, somehow, because they change the webpage's DOM. Another option would be to wait for the site to finish loading and mirroring the loaded webpage (the task isn't time critical).
我已经尝试了镜像网页与PhantomJS一起使用,但是我不能使用PhantomJS进行递归,或者至少我不知道怎么做。我还仔细查看了wget手册页,但找不到任何相应的选项。
I've already tried mirroring the webpage with PhantomJS, but I can't use recursion using PhantomJS, or at least I couldn't find out how. I also took a closer look at the wget man page, but couldn't find any corresponding options.
是否有可能这样做?
推荐答案
wget
不执行任何JavaScript。您可能需要通过之类的代理。之前,我曾经使用过易碎的蜘蛛,但从未使用过wget。值得尝试
wget
doesn't execute any javascript. You might need to go through a proxy like splash. I've used splash before with scrapy spiders, but never with wget. Worth trying though
这篇关于JS执行后递归地镜像网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!