问题描述
我有一个文件 a.out,其中包含多行.每行只有一个字符,Unicode 字符 U+2013
或小写字母 a-z
.
I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013
or a lower case letter a-z
.
对 a.out 执行文件命令会得到 UTF-8 Unicode 文本.
Doing a file command on a.out elicits the result UTF-8 Unicode text.
locale 命令报告:
The locale command reports:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
如果我发出命令 grep -P -n "[^x00-xFF]" a.out
我希望只有包含 U+2013
的行被退回.如果我在 cygwin 下进行测试就是这种情况.然而,问题环境是 Oracle Linux Server 6.5 版,问题是 grep 命令不返回任何行.如果我发出 grep -P -n "[x00-xFF]
" a.out 然后返回所有行.
If I issue the command grep -P -n "[^x00-xFF]" a.out
I would expect only the lines containing U+2013
to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[x00-xFF]
" a.out then all lines are returned.
我意识到[grep -P]
...是高度实验性的,grep -P
可能会警告未实现的功能."但没有发出警告.
I realise that "[grep -P]
...is highly experimental and grep -P
may warn of unimplemented features." but no warnings are issued.
我错过了什么吗?
推荐答案
我建议避免狡猾的 grep -P
实现并使用真实的东西.这有效:
I recommend avoiding dodgy grep -P
implementations and use the real thing. This works:
perl -CSD -nle 'print "$.: $_" if /P{ASCII}/' utfile1 utfile2 utfile3 ...
地点:
-CSD
选项表示 stdio 三重奏(stdin、stdout、stderr)和磁盘文件都应被视为 UTF-8 编码.
The
-CSD
options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.
$.
代表当前记录(行)号.
The $.
represents the current record (line) number.
$_
代表当前行.
P{ASCII}
匹配任何非 ASCII 的代码点.
The P{ASCII}
matches any code point that is not ASCII.
这篇关于搜索非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!