python - 如何将Scrapy Web爬网程序与Luigi数据管道集成在一起？

（长时间的用户和第一个问题&紧张地问）是真的
我目前正在构建一个python后端，它将部署到一个aws ec2实例，该实例具有以下架构：
|----数据源-----temp存储---data处理-----db----|
Web爬网程序数据-----*保存到S3*=\
API数据----------保存到S3*==>Lugi数据管道-->MongoDB
如上所示，我们有不同的获取数据的方法（即api请求、scrapy web crawler等），但棘手/困难的部分是提出一种简单且容错的方法，将接收到的数据连接到luigi数据管道。
有没有办法将网络爬虫的输出集成到luigi数据管道中？如果不是，那么如何才能弥合http数据获取器和luigi任务之间的差距呢？
任何建议、文件或文章将不胜感激！另外，如果你需要更多的细节，我会尽快把他们带到这里。
谢谢您！

最佳答案

我从没用过路易吉。但我用的是刮胡。我猜真正的问题是你如何以合理的方式通知Luigi有新的数据要处理？
有一个类似的问题你可以从这里学到：When a new file arrives in S3, trigger luigi task
也许你们在同一个地方工作：）。
我强烈建议把你的蜘蛛托管在scrapyd，并使用scrapyd客户端来驱动它。如果你试图在使用twisted库的其他工具中运行scrapy，就会弹出各种各样的毛茸茸的东西（不确定luigi是否运行）。我会使用scrapyd客户端驱动spider，让你的spider发布到一个触发器url，告诉luigi以某种方式启动任务。
再说一次，由于我没有使用luigi，我不知道那里的细节……但你不想忙着检查/投票，看看工作是否完成。
我有一个django web应用程序，启动spider，从scrapyd客户端存储jobid，完成后在肩膀上点击json，然后使用芹菜和solr摄取数据。
编辑以包含来自以下注释的管道代码：

        for fentry in item['files']:

            # open and read the file
            pdf = open(rootdir+os.path.sep+fentry['path'],'rb').read()

            # just in case we need cookies
            cj     = CookieJar()
            opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

            # set the content type
            headers = {
            'Content-Type': 'application/json'
            }

            #fill out the object
            json_body = json.dumps({
            'uid'   : 'indeed-'+item['indeed_uid'],
            'url'   : item['candidate_url'],
            'first_name' : fname,
            'last_name'  : lname,
            'pdf'     : base64.b64encode(pdf).decode(),
            'jobid': spider.jobid
            }).encode()
            #, ensure_ascii=False)

            # send the POST and read the result
            request = urllib.request.Request('http://localhost:8080/api/someapi/', json_body, headers)
            request.get_method = lambda: 'POST'
            response = opener.open(request)

关于python - 如何将Scrapy Web爬网程序与Luigi数据管道集成在一起？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/44531459/