问题描述
我发现PHP函数basename()以及pathinfo()对于多字节utf-8名称都具有奇怪的行为.它们会删除所有非拉丁字符,直到第一个拉丁字符或标点符号为止.但是,此后,将保留后续的非拉丁字符.
I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names.They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.
basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above
但奇怪的是,pathinfo()的目录名部分工作正常:
but curiously the dirname part of pathinfo() works fine:
pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"
PHP文档警告 basename()和 pathinfo()函数具有区域设置功能,但这不能证明pathinfo(..., PATHINFO_BASENAME)
和pathinfo(..., PATHINFO_DIRNAME)
之间的不一致,更不用说根据非拉丁字符相对于拉丁字符的位置,它们会被丢弃还是被接受.
PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME)
and pathinfo(..., PATHINFO_DIRNAME)
, not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.
这听起来像是一个PHP错误.
It sounds like a PHP bug.
由于"basename"检查对于安全问题而言非常重要,可以避免直接遍历,因此是否有任何可靠的basename过滤器可以很好地与unicode输入配合使用?
Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?
推荐答案
我发现更改语言环境可以解决所有问题.
I've found that changing the locale fixes everything.
虽然Apache默认情况下以"C"语言环境运行,但cli脚本默认情况下以utf-8语言环境运行,例如"en_US.UTF-8"(在我的情况下为"it_IT.UTF-8").在这种情况下,不会发生此问题.
While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.
因此,Apache的解决方法是在调用这些函数之前将语言环境从"C"更改为"C.UTF-8".
Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.
setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct
或更妙的是,如果您想备份当前的语言环境并在完成后还原它:
Or even better, if you want to backup the current locale and restore it once done:
$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();
class LocaleManager
{
/** @var array */
private $backup;
public function doBackup()
{
$this->backup = array();
$localeSettings = setlocale(LC_ALL, 0);
if (strpos($localeSettings, ";") === false)
{
$this->backup["LC_ALL"] = $localeSettings;
}
// If any of the locales differs, then setlocale() returns all the locales separated by semicolon
// Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
else
{
$locales = explode(";", $localeSettings);
foreach ($locales as $locale)
{
list ($key, $value) = explode("=", $locale);
$this->backup[$key] = $value;
}
}
}
public function doRestore()
{
foreach ($this->backup as $key => $value)
{
setlocale(constant($key), $value);
}
}
public function fixLocale()
{
setlocale(LC_ALL, "C.UTF-8");
}
}
这篇关于具有多字节UTF-8文件名的PHP basename()和pathinfo()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!