问题描述
我有一些特殊字符,例如非换空间,非打散的连字符,等等的文件。我想这正常化和文件空间替换这些特殊字符。另外由于这一文件的内容是从不同的资源聚集,我对此非常有不同的形式星爷(ی),我想他们正常化
I have a document with some special characters like non-breaking space, non-breaking hyphen, and so on. I want to normalize this document and replace these special characters with space. In addition since the content of this document is gathered from different resources, I have different forms of "Yeh" (ی) in it, and I want to normalize them.
是否有可能找到并使用 SED 命令替换文档中的UNI code字?我可以使用统一code codeS而不是人物的表面形式?比如我可以用x00a0代替非换空间sed命令?怎么样?
Is it possible to find and replace unicode characters in a document using sed command? Can I use Unicode codes instead of surface form of the character? for example can I use x00a0 instead of non-breaking space in sed command? How?
抱歉不好解释。
我的证件都带在UTF8 codeD,并含有非英文字符。比如我有阿拉伯文文件,以乌尔都语一个文件,一个在波斯语。现在我想通过另一个字符替换一些在这些文件中的字符。
通过正火,我的意思是我想要替换所有形式的星爷变成一种形式。 (正如你可能现在,有这种性质的是阿拉伯文使用多种形式,但为了简化和一些处理问题,我想统一所有这些形式。
Sorry for bad explanation.My documents are encoded in UTF8, and contain non-English characters. for example I have a document in Arabic, a document in Urdu, and one in Persian (Farsi). now I want to replace some of the characters in these files by another character.By normalizing, I mean that I want to replace all forms of "Yeh" into one form. (As you might now, there are many forms of this character which is used in Arabic, but for simplification and some processing issues I want to unify all these forms.
推荐答案
要处理UTF-8文件,你必须分析每个人物从开始到结束。如果您需要有效做到这一点,你必须写一个真正的程序而不是试图脚本的解决方案。
To process UTF-8 files, you have to parse each characters from begin to end. If you need to do it efficiently, you have to write a real program rather then trying to script a solution.
如果你只是想脚本,它很容易将其转换为UTF-16,然后处理的字符。
If you just want to script it, it is easier to convert it to UTF-16 and then process the characters.
一个相当低效的方式是:
A fairly inefficient way would be:
#!/bin/bash
function px {
local a="$@"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out=$out$line
px $out
out=''
else
# put your coversion logic here.
# e.g
# if [ "$line" == "0031" ] ; then
# line="0041"
# fi
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8
这篇关于查找和替换猛砸不换行空格字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!