问题描述
版本:Python 2.7.3
其他库:Python-Requests 1.2.3、jinja2 (2.6)
我有一个向论坛提交数据的脚本,问题是非 ascii 字符显示为垃圾.例如,像 André Téchiné 这样的名字就变成了 André Téchiné.
以下是提交数据的方式:
1) 最初从 UTF-8 编码的 CSV 文件加载数据,如下所示:
条目 = []使用 codecs.open(filename, 'r', 'utf-8') 作为 f:对于 unicode_csv_reader(f.readlines()[1:]) 中的行:entry.append(dict(zip(csv_header, row)))
unicode_csv_reader 来自 Python CSV 文档页面的底部:http://docs.python.org/2/图书馆/csv.html
当我在解释器中输入条目名称时,我看到名称为 u'Andr\xe9 T\xe9chin\xe9'
.
2) 接下来我通过 jinja2 渲染数据:
tpl = tpl_env.get_template(u'forumpost.html')渲染 = tpl.render(entries=entries)
当我输入在解释器中呈现的名称时,我再次看到相同的内容:u'Andr\xe9 T\xe9chin\xe9'
现在,如果我将渲染变量写入这样的文件名,它会正确显示:
with codecs.open('out.txt', 'a', 'utf-8') as f:f.写(渲染)
但我必须将它发送到论坛:
3) 在 POST 请求代码中,我有:
params = {u'post': 渲染}headers = {u'content-type': u'application/x-www-form-urlencoded'}session.post(posturl, data=params, headers=headers, cookies=session.cookies)
会话是请求会话.
而且该名称在论坛帖子中显示已损坏.我尝试了以下方法:
- 省略标题
- 编码呈现为 render.encode('utf-8')(结果相同)
- rendered = urllib.quote_plus(rendered)(全部为 %XY)
如果我输入 render.encode('utf-8') 我会看到以下内容:
'Andr\xc3\xa9 T\xc3\xa9chin\xc3\xa9'
我该如何解决这个问题?谢谢.
您的客户表现得如其所应,例如运行 nc -l 8888
作为服务器并发出请求:
导入请求requests.post('http://localhost:8888', data={u'post': u'Andr\xe9 T\xe9chin\xe9'})
显示:
POST/HTTP/1.1主机:本地主机:8888内容长度:33内容类型:应用程序/x-www-form-urlencoded接受编码:gzip、放气、压缩接受: */*用户代理:python-requests/1.2.3 CPython/2.7.3post=Andr%C3%A9+T%C3%A9chin%C3%A9
您可以检查它是否正确:
>>>导入 urllib>>>urllib.unquote_plus(b"Andr%C3%A9+T%C3%A9chin%C3%A9").decode('utf-8')u'Andr\xe9 T\xe9chin\xe9'检查服务器是否正确解码了请求.您可以尝试指定字符集:
headers = {"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"}
正文仅包含 ascii 字符,因此它不应该受到伤害,并且正确的服务器无论如何都会忽略
x-www-form-urlencoded
类型的任何参数.在 URL 编码的表单数据检查问题不是显示伪影,即值正确但显示不正确
Version: Python 2.7.3
Other libraries: Python-Requests 1.2.3, jinja2 (2.6)
I have a script that submits data to a forum and the problem is that non-ascii characters appear as garbage. For instance a name like André Téchiné comes out as André Téchiné.
Here's how the data is submitted:
1) Data is initially loaded from a UTF-8 encoded CSV file like so:
entries = []
with codecs.open(filename, 'r', 'utf-8') as f:
for row in unicode_csv_reader(f.readlines()[1:]):
entries.append(dict(zip(csv_header, row)))
unicode_csv_reader is from the bottom of Python CSV documentation page: http://docs.python.org/2/library/csv.html
When I type the entries name in the interpreter, I see the name as u'Andr\xe9 T\xe9chin\xe9'
.
2) Next I render the data through jinja2:
tpl = tpl_env.get_template(u'forumpost.html')
rendered = tpl.render(entries=entries)
When I type the name rendered in the interpreter I see again the same: u'Andr\xe9 T\xe9chin\xe9'
Now, if I write the rendered variable to a filename like this, it displays correctly:
with codecs.open('out.txt', 'a', 'utf-8') as f:
f.write(rendered)
But I must send it to the forum:
3) In the POST request code I have:
params = {u'post': rendered}
headers = {u'content-type': u'application/x-www-form-urlencoded'}
session.post(posturl, data=params, headers=headers, cookies=session.cookies)
session is a Requests session.
And the name is displayed broken in the forum post. I have tried the following:
- Leave out headers
- Encode rendered as rendered.encode('utf-8') (same result)
- rendered = urllib.quote_plus(rendered) (comes out as all %XY)
If I type rendered.encode('utf-8') I see the following:
'Andr\xc3\xa9 T\xc3\xa9chin\xc3\xa9'
How could I fix the issue? Thanks.
Your client behaves as it should e.g. running nc -l 8888
as a server and making a request:
import requests
requests.post('http://localhost:8888', data={u'post': u'Andr\xe9 T\xe9chin\xe9'})
shows:
POST / HTTP/1.1
Host: localhost:8888
Content-Length: 33
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/1.2.3 CPython/2.7.3
post=Andr%C3%A9+T%C3%A9chin%C3%A9
You can check that it is correct:
>>> import urllib
>>> urllib.unquote_plus(b"Andr%C3%A9+T%C3%A9chin%C3%A9").decode('utf-8')
u'Andr\xe9 T\xe9chin\xe9'
check the server decodes the request correctly. You could try to specify the charset:
headers = {"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"}
the body contains only ascii characters so it shouldn't hurt and the correct server would ignore any parameters for
x-www-form-urlencoded
type anyway. Look for gory details in URL-encoded form datacheck the issue is not a display artefact i.e., the value is correct but it displays incorrectly
这篇关于Python 请求编码 POST 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!