问题描述
注意:有关与此相关的更多解答,请参阅
我注意到当抓取Google Calculator计算器的返回值时,数千个地方被相当奇怪的字符分隔。这不仅仅是一个空间。
让我们以4000美元兑换成英镑为例。
如果你请访问以下Google链接:
您会注意到,回复是:
{lhs:4000美元,rhs :2 497.81441 British Pound,error:,icc:true}
并且千位显示为以空格字符分隔。
但是,如果您在命令行中输入以下内容:
curl -shttp://www.google.com/ig/calculator?hl=zh-TW&q=4000%20usd%20to%20gbp
您会注意到响应是:
<$ p $
该问号(?)是替换字符。这是怎么回事?
AppleScript会返回一个不同的替换字符:
{lhs:4000美元,rhs:2†498.28243英镑,错误:,icc:true}
我也从其他来源获得:
{lhs:4000美元,rhs :2 498.28243英镑,错误:,icc:true}
那么 是合适的Unicode替代字符65533.
任何人都可以让我洞察Google传递给我的内容吗?
这是一个非破坏性的空间,U + 00A0。这是为了确保该数字不会在行尾结束。
然而Google会返回正确的编码(UTF-8):
Content-Type:text / html; charset = UTF-8
so ...
- 如果它以普通空格(U + 0020)出现(Firefox在复制时足够愚蠢),那么应用程序会执行某些字符转换为lookalikes,可能适合某种受限制的代码页(可能是ASCII)。
- 如果存在问号,则它被正确地读为Unicode,但某些处理中的部分使用的遗留字符集包含该字符以便转换。如果存在替换字符U(U + FFFD),则它可能被读为UTF-8,转换为包含字符(例如Latin 1),然后重新解释为UTF-8。
- 如果存在完全不同的字符,比如你的匕首(†),那么我会猜测响应被正确读取为Unicode,转换为包含该字符的字符集并在另一个字符集中重新解释。快速浏览代码页可以发现A0的确映射到了†。
毋庸置疑,无论您用于处理该响应的任何内容的某些部分在Unicode方面似乎都很糟糕。我希望在这个千禧年中不会真的发生这种事情,但显然它仍然存在。
我想看看它在PowerShell中摆弄的内容:
PS Home:\> $ wc = new-object net.webclient
PS主页:\> $ x = $ wc.downloadstring('http://www.google.com/ig/calculator?hl=zh-CN&q=4000%20usd%20to%20gbp')
PS主页:\> [char []] $ x |%{$ _ - + + $ _}
...
- 34
2 - 50
- 160
4 - 52
9 - 57
8 - 56
。 - 46
2 - 50
8 - 56
2 - 50
4 - 52
...
快速查看响应头文件,发现编码设置正确。
NOTE: For more answers related to this, please seeSpecial Characters in Google Calculator
I noticed when grabbing the return value for a Google Calculator calculation, the thousands place is separated by a rather odd character. It is not simply a space.
Let's take the example of converting $4,000 USD to GBP.
If you visit the following Google link:
http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp
You'll note that the response is:
{lhs: "4000 U.S. dollars",rhs: "2 497.81441 British pounds",error: "",icc: true}
This looks reasonable, and the thousands place appears to be separated by a whitespace character.
However, if you enter the following into your command line:
curl -s "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"
You'll note that the response is:
{lhs: "4000 U.S. dollars",rhs: "2?498.28243 British pounds",error: "",icc: true}
That question mark (?) is a replacement character. What is going on?
AppleScript returns a different replacement character:
{lhs: "4000 U.S. dollars",rhs: "2†498.28243 British pounds",error: "",icc: true}
I am also getting from other sources:
{lhs: "4000 U.S. dollars",rhs: "2�498.28243 British pounds",error: "",icc: true}
It turns out that � is the proper Unicode replacement character 65533.
Can anyone give me insight into what Google is passing me?
It's a non-breaking space, U+00A0. It's to ensure that the number won't get broken at the end of a line.
Google returns the correct encoding (UTF-8) however:
Content-Type: text/html; charset=UTF-8
so ...
- if it comes out as a normal space (U+0020) instead (Firefox does that when copying, stupidly enough), then the application performs conversion of certain characters to lookalikes, maybe to fit in some sort of restricted code page (ASCII perhaps).
- if there is a question mark, then it was correctly read as Unicode but some part in processing uses a legacy character set that doesn't contain that character so it gets converted.
- if there is a replacement character � (U+FFFD) then it was likely read as UTF-8, converted into a legacy character set that contains the character (e.g. Latin 1) and then re-interpreted as UTF-8.
- if there is a totally different character, such as your dagger (†), then I'd guess the response is read correctly as Unicode, gets converted to a character set that contains the character and re-interpreted in another character set. A quick look at the Mac Roman codepage reveals that A0 indeed maps to †.
Needless to say, some parts in whatever you use in processing that response seem to be horrible broken in regard to Unicode. Something I'd hope wouldn't really happen that often in this millennium, but apparently it still does.
I figured out what it was by fiddling around in PowerShell a bit:
PS Home:\> $wc = new-object net.webclient
PS Home:\> $x = $wc.downloadstring('http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp')
PS Home:\> [char[]]$x|%{"$_ - " + +$_}
...
" - 34
2 - 50
- 160
4 - 52
9 - 57
8 - 56
. - 46
2 - 50
8 - 56
2 - 50
4 - 52
...
Also a quick look at the response headers revealed that the encoding is set correctly.
这篇关于谷歌计算器千位分隔符特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!