本文介绍了在没有活动项目的情况下使用Scrapy爬行本地文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在没有活动项目的情况下使用Scrapy 0.18.4来爬网本地文件?我已经看到了答案,它看起来很有希望,但是使用crawl命令,您需要一个项目.

Is it possible to crawl local files with Scrapy 0.18.4 without having an active project? I've seen this answer and it looks promising, but to use the crawl command you need a project.

或者,是否存在一种简单/极简的方式来为现有蜘蛛建立项目?我在一个Python文件中定义了我的Spider,管道,中间件和项目.我已经创建了只有项目名称的scrapy.cfg文件.这使我可以使用crawl,但是由于我没有蜘蛛文件夹,因此Scrapy无法找到我的蜘蛛.我可以将Scrapy指向正确的目录,还是需要将我的物品,蜘蛛等拆分成单独的文件?

Alternatively, is there an easy/minimalist way to set up a project for an existing spider? I have my spider, pipelines, middleware, and items defined in one Python file. I've created a scrapy.cfg file with only the project name. This lets me use crawl, but since I don't have a spiders folder Scrapy can't find my spider. Can I point Scrapy to the right directory, or do I need to split my items, spider, etc. up into separate files?

[edit]我忘了说我正在使用Crawler.crawl(my_spider)运行Spider-理想情况下,我仍然希望能够像这样运行Spider,但是如果那样的话,可以在我的脚本的子进程中运行它不可能.

[edit] I forgot to say that I'm running the spider using Crawler.crawl(my_spider) - ideally I'd still like to be able to run the spider like that, but can run it in a subprocess from my script if that's not possible.

在我链接的答案中发现建议确实可行- http://localhost:8000 可以用作一个start_url,因此不需要项目.

Turns out the suggestion in the answer I linked does work - http://localhost:8000 can be used as a start_url, so there's no need for a project.

推荐答案

作为一种选择,您可以从脚本运行Scrapy ,这是一个自包含示例脚本概述所使用的方法.

As an option, you can run Scrapy from a script, here is a self-contained example script and the overview of the approach used.

这并不意味着您必须将所有内容都放在一个文件中.您仍然可以使用spider.pyitems.pypipelines.py-只需将它们正确地导入开始抓取的脚本中即可.

This doesn't mean you have to put everything in one file. You can still have spider.py, items.py, pipelines.py - just import them correctly in the script you start crawling from.

这篇关于在没有活动项目的情况下使用Scrapy爬行本地文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-22 06:07