问题描述
这与上一个问题有关,这里是:将 \u 转义的 Unicode 字符串转换为 ASCII
This is related to a previous question, here: Converting a \u escaped Unicode string to ASCII
我提出了一个涉及 eval(parse(text=x))
的解决方案,对于非 R 用户来说,这意味着它所说的:解析文本字符串,然后对其进行评估.其目的是不是允许执行任意代码,而只是取消转义转义的 Unicode 文本.因此解决方案:
I proposed a solution involving eval(parse(text=x))
, which for non-R users, means what it says: parsing the text string, then evaluating it. The aim was not to allow arbitrary code to be executed, but only to un-escape escaped Unicode text. Hence the solution:
eval(parse(text=paste0("'", x, "'")))
虽然考虑到有限的目标,这应该是相当安全的,但我很想知道:需要多少消毒才能保证安全?
While this should be fairly safe given the restricted objective, I'd be interested to know: how much sanitisation is required to keep things safe?
至少,我猜任何嵌入的单引号和双引号都必须转义.例如,假设我们有
At a minimum, I guess any embedded single and double quotes have to be escaped. For example, suppose we have
x <- "this is a '; print(dir()); 'string"
然后 eval
根据上面的代码片段执行此操作将执行中间的代码.所以我们必须转义引号:
Then eval
'ing this per the snippet above would execute the code in the middle. So we have to escape the quotes:
eval(parse(text=paste0("'",
gsub("'", "\\\\'", x),
"'")))
双引号也是如此.我不认为 unescaped Unicode 等价物 \u0022
和 \u0027
是一个问题,因为对于解析器它们将与普通的相同"
和 '
.
And similarly for double quotes. I don't think the unescaped Unicode equivalents \u0022
and \u0027
are a problem, since to the parser they'll be identical to plain "
and '
.
这种方法有没有我遗漏的漏洞?
Are there any holes in this approach that I've missed?
推荐答案
this is a \'; print(dir()); 'string
被转义为:
'this is a \\'; print(dir()); 'string'
双反斜杠被评估为文字反斜杠,引用有效,代码被执行.
double-backslash is evaled as literal backslash, quote is active, code is executed.
我也不知道 R,但可能你至少可以使用原始控制字符(如换行符或无效转义符)导致崩溃.
Also I don't know about R but probably you could at minimum cause a crash using raw control characters like newline or invalid escapes.
eval
总的来说是一个杯子游戏.正常的字符串处理(搜索您想要的序列的字符串,替换它)是更好的方法,并且使用现有的库来处理特定的正确指定的格式是最好的.例如,如果您有 JSON,请使用 JSON 解析器.有许多可能的字符串文字格式使用 \u
转义,所有规则都略有不同,因此您需要正确选择确切的格式.
eval
is a mug's game in general. Normal string handling (search string for the sequence you want, replacing it) is the better approach, and using an existing library for a particular properly-specified format is best of all. For example if you have JSON, use a JSON parser. There are many possible string literal formats that use \u
escapes, all with slightly different rules, so you will want to choose the exact format correctly.
这篇关于清理 R 中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!