问题描述
所以我在客户端(输入后)删除控制字符(tab,cr,lf,\v和所有其他不可见的字符),但是由于客户端不能被信任,所以我必须在服务器中删除它们
So I am removing control characters (tab, cr, lf, \v and all other invisible chars) in the client side (after input) but since the client cannot be trusted, I have to remove them in the server too.
所以根据这个链接
控制字符从x00到1F,从7F到9F。
因此我的客户端(javascript)控件的char去除功能是:
the control characters are from x00 to 1F and from 7F to 9F.thus my client (javascript) control char removal function is:
return s.replace(/[\x00-\x1F\x7F-\x9F]/g, "");
我的php(服务器)控件的字符删除功能是:
and my php (server) control char removal function is:
$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/', '', $s);
现在,这似乎在PHP中创建了国际utf8字符(如ζ(xCF x82))的问题(因为x82是在第二个序列组内),javascript等价物不会产生任何问题。
Now this seems to create problems with international utf8 chars such as ς (xCF x82) in PHP only (because x82 is inside the second sequence group), the javascript equivalent does not create any problems.
现在我的问题是,我应该从7F到9F中删除控制字符?对于我的理解,从127到159(7F到9F)的序列显然可以是有效的UTF-8字符串的一部分?
Now my question is, should I remove the control characters from 7F to 9F? To my understanding those the sequences from 127 to 159 (7F to 9F) obviously can be part of a valid UTF-8 string?
也可能我不应该过滤00到31控制字符,因为这些字符中的一些可能会出现在一些奇怪的(japanese?chinese?)但是有效的utf-8字符?
also, maybe I shouldn't even filter the 00 to 31 control characters because also some of those characters can appear in some weird (japanese? chinese?) but valid utf-8 characters ?
推荐答案
看来,我只需要将 u 标志添加到正则表达式
中,从而变为:
it seems that I just need to add the u flag to the regexthus it becomes:
$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/u', '', $s);
这篇关于删除utf-8字符串中的控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!