本文介绍了手动计算时的内容长度标头不一样吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里的答案(原始响应的大小(以字节为单位))说:

只取响应内容的len():

>>>response = requests.get('https://github.com/')>>>len(响应.内容)51671

但是这样做并不能获得准确的内容长度.例如查看这个python代码:

导入系统进口请求def proccessUrl(url):尝试:r = requests.get(url)print("正确的内容长度:"+r.headers['Content-Length'])print("r.text 的字节数:"+str(sys.getsizeof(r.text)))print("r.content 的字节数:"+str(sys.getsizeof(r.content)))打印(len r.text :"+str(len(r.text)))打印(len r.content :"+str(len(r.content)))除了作为 e 的例外:打印(str(e))#这个 url 包含一个 content-length 头,我们将使用它来查看我们计算的内容长度是否相同.proccessUrl("https://stackoverflow.com")

如果我们尝试手动计算内容长度并将其与标题中的内容进行比较,我们会得到一个更大的答案吗?

正确的内容长度:51504r.text 的字节数:515142r.content 的字节数:257623长度文本:257552len r.content : 257606

为什么 len(r.content) 没有返回正确的内容长度?如果缺少标题,我们如何手动准确计算它?

解决方案

Content-Length 标头反映了响应的正文.这与 textcontent 属性的长度不同,因为响应可以被压缩.requests 为您解压响应.

您必须绕过许多内部管道才能获得原始的、压缩的、原始的内容,然后如果您希望 response 对象仍然正常工作,则必须访问更多的内部结构.最简单"的方法是启用流式传输,然后从原始套接字读取:

from io import BytesIOr = requests.get(网址,流=真)# 直接从原始 urllib3 连接中读取raw_content = r.raw.read()content_length = len(raw_content)# 替换内部文件对象以再次提供数据r.raw._fp = BytesIO(raw_content)

演示:

>>>进口请求>>>从 io 导入 BytesIO>>>url = "https://stackoverflow.com">>>r = requests.get(网址,流=真)>>>r.headers['Content-Encoding'] # 压缩响应'gzip'>>>r.headers['Content-Length'] # 原始响应包含 52055 字节的压缩数据'52055'>>>r.headers['Content-Type'] # 我们提供 UTF-8 HTML 数据'文本/html;字符集=utf-8'>>>raw_content = r.raw.read()>>>len(raw_content) # 原始内容正文长度52055>>>r.raw._fp = BytesIO(raw_content)>>>len(r.content) # 解压后的二进制内容,字节数258719>>>len(r.text) # UTF-8 解码后的 Unicode 内容,字符数258658

这会将完整的响应读取到内存中,因此如果您期望大的响应,请不要使用它!在这种情况下,您可以改为使用 shutil.copyfileobj() 将数据从 r.raw 文件复制到 假脱机临时文件(达到特定大小后将切换到磁盘文件),获取文件该文件的大小,然后将该文件填充到 r.raw._fp.

向任何缺少该标头的请求添加 Content-Type 标头的函数如下所示:

导入请求进口木材导入临时文件def ensure_content_length(url, *args, method='GET', session=None, max_size=2**20, # 1Mb**夸格):kwargs['stream'] = Truesession = session 或 requests.Session()r = session.request(method, url, *args, **kwargs)如果 'Content-Length' 不在 r.headers 中:# 将内容流式传输到一个临时文件中,这样我们就可以获得真实的大小spool = tempfile.SpooledTemporaryFile(max_size)关闭.copyfileobj(r.raw,线轴)r.headers['Content-Length'] = str(spool.tell())spool.seek(0)# 用我们的临时文件替换原始套接字r.raw._fp.close()r.raw._fp = 线轴返回

这接受一个现有的会话,并允许您指定请求方法.根据内存限制的需要调整 max_size.https://github.com 上的演示,缺少 Content-Length 标头:

>>>r = ensure_content_length('https://github.com/')>>>r<响应[200]>>>>r.headers['内容长度']'14490'>>>len(r.content)54814

请注意,如果没有 Content-Encoding 标头或该标头的值设置为 identity,并且 Content-Length> 可用,那么您就可以依赖 Content-Length 作为响应的完整大小.那是因为显然没有应用压缩.

附带说明:如果您所追求的是 bytesstr 的长度,则不应使用 sys.getsizeof()> 对象(该对象中的字节数或字符数).sys.getsizeof() 为您提供 Python 对象的内部内存占用,它涵盖的不仅仅是该对象中的字节数或字符数.参见 len() 和 python 中的 sys.getsizeof() 方法?

An answer here (Size of raw response in bytes) says :

However doing that does not get the accurate content length. For example check out this python code:

import sys
import requests

def proccessUrl(url):
    try:
        r = requests.get(url)
        print("Correct Content Length: "+r.headers['Content-Length'])
        print("bytes of r.text       : "+str(sys.getsizeof(r.text)))
        print("bytes of r.content    : "+str(sys.getsizeof(r.content)))
        print("len r.text            : "+str(len(r.text)))
        print("len r.content         : "+str(len(r.content)))
    except Exception as e:
        print(str(e))

#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")

If we try and manually calculate the content length and compare it to what is in the header, we get an answer that is much larger?

Correct Content Length: 51504
bytes of r.text       : 515142
bytes of r.content    : 257623
len r.text            : 257552
len r.content         : 257606

Why does len(r.content) not return the correct content length? And how can we manually calculate it accurately if the header is missing?

解决方案

The Content-Length header reflects the body of the response. That's not the same thing as the length of the text or content attributes, because the response could be compressed. requests decompresses the response for you.

You'd have to bypass a lot of internal plumbing to get the original, compressed, raw content, and then you have to access some more internals if you want the response object to still work correctly. The 'easiest' method is to enable streaming, then reading from the raw socket:

from io import BytesIO

r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)

Demo:

>>> import requests
>>> from io import BytesIO
>>> url = "https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length']   # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type']     # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content)              # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content)    # the decompressed binary content, byte count
258719
>>> len(r.text)       # the Unicode content decoded from UTF-8, character count
258658

This reads the full response into memory, so don't use this if you expect large responses! In that case, you could instead use shutil.copyfileobj() to copy the data from the r.raw file to a spooled temporary file (which will switch to an on-disk file once a certain size is reached), get the file size of that file, then stuff that file onto r.raw._fp.

A function that adds a Content-Type header to any request that is missing that header would look like this:

import requests
import shutil
import tempfile

def ensure_content_length(
    url, *args, method='GET', session=None, max_size=2**20,  # 1Mb
    **kwargs
):
    kwargs['stream'] = True
    session = session or requests.Session()
    r = session.request(method, url, *args, **kwargs)
    if 'Content-Length' not in r.headers:
        # stream content into a temporary file so we can get the real size
        spool = tempfile.SpooledTemporaryFile(max_size)
        shutil.copyfileobj(r.raw, spool)
        r.headers['Content-Length'] = str(spool.tell())
        spool.seek(0)
        # replace the original socket with our temporary file
        r.raw._fp.close()
        r.raw._fp = spool
    return r

This accepts an existing session, and lets you specify the request method too. Adjust max_size as needed for your memory constraints. Demo on https://github.com, which lacks a Content-Length header:

>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814

Note that if there is no Content-Encoding header present or the value for that header is set to identity, and the Content-Length is available, then just you can rely on Content-Length being the full size of the response. That's because then there is obviously no compression applied.

As a side note: you should not use sys.getsizeof() if what your are after is the length of a bytes or str object (the number of bytes or characters in that object). sys.getsizeof() gives you the internal memory footprint of a Python object, which covers more than just the number of bytes or characters in that object. See What is the difference between len() and sys.getsizeof() methods in python?

这篇关于手动计算时的内容长度标头不一样吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-27 09:59
查看更多