问题描述
我们最近将一个网站移到了新服务器上,但遇到了一个奇怪的问题,即某些上传的文件名中包含 unicode 字符的图像给我们带来了 404 错误.
We've recently moved a website to a new server, and are running into an odd issue where some uploaded images with unicode characters in the filename are giving us a 404 error.
通过ssh/FTP,我们可以看到文件肯定在那里.
Via ssh/FTP, we can see that the files are definitely there.
例如:
http://sjofasting.no/project/adnoy
所有图像均无效:
代码:
<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/>
SSH:
-rw-r--r-- 1 xxxxxxxx xxxxxxxx 836813 八月 3 日 16:12 ådnøy_1_2.jpg
同样奇怪的是,如果您导航到目录,您甚至可以单击图像并且它可以工作:
What is also strange is that if you navigate to the directory you can even click on the image and it works:
http://sjofasting.no/wp/wp-content/uploads/2012/03/
点击ådnøy_1_2.jpg"就可以了.
click on 'ådnøy_1_2.jpg' and it works.
不知何故 wordpress 正在生成
Somehow wordpress is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg
并从直接文件夹浏览复制生成
and copying from the direct folder browse is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg
这是怎么回事??
如果我从 wordpress 源复制图像 url,我得到:
If I copy the image url from the wordpress source I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg
从 apache 浏览器复制时,我得到:
When copied from the apache browser I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg
造成这种差异的原因是:%C3%A5 和 %cc%8
What could account for this discrepancy between:%C3%A5 and %cc%8
??
推荐答案
Unicode 规范化.
Unicode normalisation.
0xC3
0xA5
是 U+00E5 a-with-ring 的 UTF-8 编码.
0xC3
0xA5
is the UTF-8 encoding for U+00E5 a-with-ring.
0xCC
0x8A
是 U+030A 组合环的 UTF-8 编码.
0xCC
0x8A
is the UTF-8 encoding for U+030A combining ring.
U+0035 是写 a 环的组合方式(标准 C 型);a
字母后跟 U+030A 是分解(正常形式 D)的书写方式.å
与 å
- 它们看起来应该相同,但它们可能会因字体渲染而略有不同.
U+0035 is the composed (Normal Form C) way of writing an a-ring; an a
letter followed by U+030A is the decomposed (Normal Form D) way of writing it. å
vs å
- they should look the same, though they may differ slightly depending on font rendering.
现在通常情况下,您拥有哪一个并不重要,因为合理的文件系统不会影响它们.如果您保存一个名为 [char U+00E5].txt
(å.txt
) 的文件,它在 Windows 和 Linux 下仍保持该名称.
Now normally it doesn't really matter which one you've got because sensible filesystems leave them untouched. If you save a file called [char U+00E5].txt
(å.txt
), it stays called that under Windows and Linux.
另一方面,Mac 很疯狂.文件系统更喜欢范式 D,因为您传递给它的任何组合字符都会被转换为分解字符.如果你把一个名为 [char U+00E5].txt
的文件放入并立即列出目录,你会发现你实际上有一个名为 a[char U+030A] 的文件.txt
.您仍然可以在 Mac 上以 [char U+00E5].txt
的形式访问该文件,因为在查找之前它也会将该输入转换为 Normal Form D,但是您无法在字符序列术语中恢复与您输入的文件名相同的文件名:这是一种有损转换.
Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called [char U+00E5].txt
and immediately list the directory, you'll find you've actually got a file called a[char U+030A].txt
. You can still access the file as [char U+00E5].txt
on a Mac because it'll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it's a lossy conversion.
因此,如果您将文件保存在 Mac 上,然后传输到 [char U+00E5].txt
和 a[char U+030A].txt
的文件系统> 引用不同的文件,你会得到断开的链接.
So if you save your files on a Mac and then transfer to a filesystem where [char U+00E5].txt
and a[char U+030A].txt
refer to different files, you will get broken links.
更新页面以指向 URL 的范式 D 版本,或从不会严重破坏 Unicode 字符的文件系统重新上传文件.
Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn't egregiously mangle Unicode characters.
思维不同,导致奇怪的互操作性问题.
Think Different, Cause Bizarre Interoperability Problems.
这篇关于Wordpress/Apache - 图像文件名中的 unicode 字符出现 404 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!