问题描述
是否可以在没有活动项目的情况下使用Scrapy 0.18.4来爬网本地文件?我已经看到了此答案,它看起来很有希望,但是使用crawl
命令,您需要一个项目.
Is it possible to crawl local files with Scrapy 0.18.4 without having an active project? I've seen this answer and it looks promising, but to use the crawl
command you need a project.
或者,是否存在一种简单/极简的方式来为现有蜘蛛建立项目?我在一个Python文件中定义了我的Spider,管道,中间件和项目.我已经创建了只有项目名称的scrapy.cfg文件.这使我可以使用crawl
,但是由于我没有蜘蛛文件夹,因此Scrapy无法找到我的蜘蛛.我可以将Scrapy指向正确的目录,还是需要将我的物品,蜘蛛等拆分成单独的文件?
Alternatively, is there an easy/minimalist way to set up a project for an existing spider? I have my spider, pipelines, middleware, and items defined in one Python file. I've created a scrapy.cfg file with only the project name. This lets me use crawl
, but since I don't have a spiders folder Scrapy can't find my spider. Can I point Scrapy to the right directory, or do I need to split my items, spider, etc. up into separate files?
[edit]我忘了说我正在使用Crawler.crawl(my_spider)
运行Spider-理想情况下,我仍然希望能够像这样运行Spider,但是如果那样的话,可以在我的脚本的子进程中运行它不可能.
[edit] I forgot to say that I'm running the spider using Crawler.crawl(my_spider)
- ideally I'd still like to be able to run the spider like that, but can run it in a subprocess from my script if that's not possible.
在我链接的答案中发现建议确实可行- http://localhost:8000 可以用作一个start_url,因此不需要项目.
Turns out the suggestion in the answer I linked does work - http://localhost:8000 can be used as a start_url, so there's no need for a project.
推荐答案
作为一种选择,您可以从脚本运行Scrapy ,这是一个自包含示例脚本和概述所使用的方法.
As an option, you can run Scrapy from a script, here is a self-contained example script and the overview of the approach used.
这并不意味着您必须将所有内容都放在一个文件中.您仍然可以使用spider.py
,items.py
,pipelines.py
-只需将它们正确地导入开始抓取的脚本中即可.
This doesn't mean you have to put everything in one file. You can still have spider.py
, items.py
, pipelines.py
- just import them correctly in the script you start crawling from.
这篇关于在没有活动项目的情况下使用Scrapy爬行本地文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!