问题描述
我正在研究 Scrapy 蜘蛛,试图从中的多个PDF文件提取文本使用 slate 的目录.我对将实际的PDF保存到磁盘没有兴趣,因此建议我在 https://docs.python.org/2/library/io.html#buffered-streams .
I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.
但是我不确定如何将PDF正文传递给bytesIO类,然后传递虚拟PDF平板以获取文本.到目前为止,我有:
However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:
class Ove_Spider(BaseSpider):
name = "ove"
allowed_domains = ['myurl.com']
start_urls = ['myurl/hgh/']
def parse(self, response):
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
in_memory_pdf = BytesIO()
in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body
我得到了:
in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'
我该如何工作?
推荐答案
执行in_memory_pdf.read(response.body)
时,应该传递要读取的字节数.您要初始化缓冲区,而不是读入缓冲区.
When you do in_memory_pdf.read(response.body)
you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.
在python 2中,只需将BytesIO
初始化为:
In python 2, just initialize BytesIO
as:
in_memory_pdf = BytesIO(response.body)
在Python 3中,不能将BytesIO
与字符串一起使用,因为它需要字节.错误消息显示response.body
的类型为str
:我们必须对其进行编码.
In Python 3, you cannot use BytesIO
with a string because it expects bytes. The error message shows that response.body
is of type str
: we have to encode it.
in_memory_pdf = BytesIO(bytes(response.body,'ascii'))
但是由于pdf可以是二进制数据,所以我想response.body
将是bytes
,而不是str
.在这种情况下,简单的in_memory_pdf = BytesIO(response.body)
可以工作.
But as a pdf can be binary data, I suppose that response.body
would be bytes
, not str
. In that case, the simple in_memory_pdf = BytesIO(response.body)
works.
这篇关于创建bytesIO对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!