本文介绍了离线(本地)数据上的 Python Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的计算机上有一个 270MB 的数据集(10000 个 html 文件).我可以使用 Scrapy 在本地抓取这个数据集吗?怎么样?

I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?

推荐答案

SimpleHTTP Server Hosting

如果你真的想在本地托管它并使用scrapy,你可以通过导航到它存储的目录并运行SimpleHTTPServer(如下所示的端口8000)来提供它:

SimpleHTTP Server Hosting

If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):

python -m SimpleHTTPServer 8000

然后将scrapy指向127.0.0.1:8000

Then just point scrapy at 127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

file://

另一种方法是直接将scrapy指向一组文件:

file://

An alternative is to just have scrapy point to the set of files directly:

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

总结

一旦你为scrapy设置了scraper(参见示例dirbot),运行爬虫:

$ scrapy crawl 127.0.0.1:8000

如果 html 文件中的链接是绝对链接而不是相对链接,则这些链接可能无法正常工作.您需要自己调整文件.

If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.

这篇关于离线(本地)数据上的 Python Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 11:29