问题描述
从阅读各种帖子,似乎JavaScript的 unescape()
相当于Pythons urllib.unquote()
,但是当我测试时,我会得到不同的结果:
From reading various posts, it seems like JavaScript's unescape()
is equivalent to Pythons urllib.unquote()
, however when I test both I get different results:
unescape('%u003c%u0062%u0072%u003e');
输出: < br>
import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')
输出: %u003c%u0062%u0072%u003e
我希望Python也能返回<峰; br>
。任何关于我在这里失踪的想法?
I would expect Python to also return <br>
. Any ideas as to what I'm missing here?
谢谢!
推荐答案
%uxxxx
是一个
%uxxxx
is a non standard URL encoding scheme that is not supported by urllib.unquote()
.
它只是ECMAScript ECMA-262第3版的一部分;该格式被W3C拒绝,从来不是RFC的一部分。
It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC.
您可以使用正则表达式转换这样的代码点:
You could use a regular expression to convert such codepoints:
re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)
这将解码%uxxxx
和%uxx
表单ECMAScript 3rd ed可以解码。
This decodes both the %uxxxx
and the %uxx
form ECMAScript 3rd ed can decode.
演示:
>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)
u'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), altquoted)
u'<br>'
但你应该尽可能避免使用编码。
but you should avoid using the encoding altogether if possible.
这篇关于Javascript unescape()vs. Python urllib.unquote()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!