问题描述
在网络应用服务器上,当我尝试使用 java.net.URLEncoder
编码 médicaux_Jérôme.txt
时,以下字符串:
me%CC%81dicaux_Je%CC%81ro%CC%82me.txt
当我在后端服务器上尝试编码相同的字符串时,它给出以下内容:
m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt
有人可以帮助我理解同一输入的不同输出吗?另外,每次解码相同的字符串时,如何获得标准化的输出?
结果取决于平台,如果您不这样做的话
请参见:
因此,使用并指定编码:
String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, UTF-8 ;)
关于同一字符串的不同表示形式(如果指定了 UTF-8
:
您在问题中输入的两个URL编码字符串虽然编码不同,但它们表示相同的未编码值,因此那里并没有什么天生的错误。通过将两个,我们可以验证它们是否相同。
这是因为我们在这种情况下看到的事实是,有多种方法可以对同一字符串进行URL编码,特别是当它们带有重音符号时(由于,这正是您所遇到的情况。
具体来说,第一个字符串将é
编码为 e
+ ´
( +结合了重音符号),产生了 e%CC%81
。第二个编码直接将é
编码为%C3%A9
(-两个%
,因为在UTF-8中需要两个字节。
同样,这两种表示形式都没有问题。两者都是的形式。众所周知,Mac OS X倾向于使用组合的重音符号进行编码。最后,这是编码器的偏好问题。在您的情况下,必须有不同的JRE,或者,如果该文件名是用户生成的,则用户可能使用了生成该编码的其他OS(或工具)。
On the webapp server when I try encoding "médicaux_Jérôme.txt
" using java.net.URLEncoder
it gives following string:
me%CC%81dicaux_Je%CC%81ro%CC%82me.txt
While on my backend server when I try encoding the same string it gives following:
m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt
Can someone help me understanding the different output for the same input? Also how can I get standardized output each time I decode the same string?
The outcome depends on the platform, if you don't specify it.
See the java.net.URLEncoder
javadocs:
So, use the suggested method and specify the encoding:
String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, "UTF-8")
About different representations for the same string, if you specified "UTF-8"
:
The two URL encoded strings you gave in the question, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. By writing both in a decode tool, we can verify that they are the same.
This is due, as we are seeing in this case, to the fact that there are multiple ways to URL encode the same string, specially if they have acute accents (due to the combining acute accent, precisely what happens in your case).
To your case, specifically, the first string encoded é
as e
+ ´
(latin small letter e + combining acute accent) resulting in e%CC%81
. The second encoded é
directly to %C3%A9
(latin small letter e with acute - two %
because in UTF-8 it takes two bytes).
Again, there is no problem with either representation. Both are forms of Unicode Normalization. It is known that Mac OS Xs tend to encode using the combining acute accent; in the end, it is a matter of preference of the encoder. In your case, there must be different JREs or, if that file name was user generated, then the user may have used a different OS (or tool) that generated that encoding.
这篇关于为什么java.net.URLEncoder对相同的字符串给出不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!