因此,我试图在将数据导出到XML时,将使用Scrapy从网站上获取的数据导出为特定格式。
下面是我希望我的XML看起来像的:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<row>
<field1><![CDATA[Data Here]]></field1>
<field2><![CDATA[Data Here]]></field2>
</row>
</data>
我正在运行scrape命令:
$ scrapy crawl my_scrap -o items.xml -t xml
我得到的电流输出如下:
<?xml version="1.0" encoding="utf-8"?>
<items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item>
如您所见,它正在添加
<value>
字段,我无法重命名根节点或项节点。我知道我需要使用XmlItemExporter,但我不知道如何在我的项目中实现这一点。我试图将其添加到
pipelines.py
中,如图所示here,但最终总是出现错误:AttributeError: 'CrawlerProcess' object has no attribute 'signals'
有人知道在使用
XmlItemExporter
将数据导出到xml时如何重新格式化数据的示例吗?编辑:
在我的
piplines.py
模块中显示我的xmlitemexporter:from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
编辑(显示修改和回溯):
我修改了
spider_opened
函数: def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file, 'data', 'row')
self.exporter.start_exporting()
我得到的线索是:
Traceback (most recent call last):
File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda>
spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats()))
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred
result = f(*args, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed
self.exporter.finish_exporting()
exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter'
最佳答案
只需提供所需节点的名称,就可以让XmlItemExporter
完成大部分所需的操作:
XmlItemExporter(file, 'data', 'row')
See the documentation
字段中
value
元素的问题在于这些字段不是标量值。如果xmlitemporter遇到标量值,它只输出<fieldname>data</fieldname>
,但如果遇到iterable值,它将按如下方式序列化:<fieldname><value>data1</value><value>data2</value></fieldname>
。解决方案是停止为项目发出非标量字段值。如果您不愿意这样做,则子类
XmlItemExporter
并重写其_export_xml_field
方法,以便在项值为iterable时执行所需的操作。This is the code for XmlItemExporter
这样你就可以看到实现了。