问题描述
两个相关的问题.Perl 6 非常聪明,它将字素理解为一个字符,无论是一个 Unicode 符号(如 ä
、U+00E4
)还是两个或更多组合符号(如p̄
和 ḏ̣
).这个小代码
我的@symb;@symb.push("ä");@symb.push("p" ~ 0x304.chr);# "p̄"@symb.push("ḏ" ~ 0x323.chr);# "ḏ̣"为@symb 说$_ 有 {$_.chars} 个字符";
给出以下输出:
ä 有 1 个字符p̄ 有 1 个字符ḏ̣ 有 1 个字
但有时我希望能够执行以下操作.1) 从 ä
中删除变音符号.所以我需要一些像
"ä".mymethod → "a"
2) 将组合"符号拆分为多个部分,即将 p̄
拆分为 p
和 Combining Macron U+0304
.例如.类似于 bash
中的以下内容:
$ echo p̄ |格雷普.-o |wc -l2
Perl 6 在 Str
类中具有强大的 Unicode 处理支持.要执行 (1) 中的要求,您可以使用 samemark
方法/例程.
根据文档:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)方法 samemark(Str:D: Str:D $pattern --> Str:D)
返回$string
的副本,其中每个字符的标记/重音信息已更改,以便与$pattern
中相应字符的标记/重音匹配.如果 $string
比 $pattern
长,$string
中剩余的字符接收与 中最后一个字符相同的标记/重音$模式
.如果 $pattern
为空,则不会进行任何更改.
示例:
say 'åäö'.samemark('aäo');# 输出:«aäo»说 'åäö'.samemark('a');# 输出:«aao»说 samemark('Pêrl', 'a');# 输出:«Perl»说samemark('aöä', '');# 输出:«aöä»
这既可用于从字母中删除标记/变音符号,也可用于添加它们.
对于 (2),有几种方法可以做到这一点 (TIMTOWTDI).如果你想要一个字符串中所有代码点的列表,你可以使用 ords
方法获取字符串中所有代码点的 List
(技术上是 Positional
).
说p̄".ords;# 输出:«(112 772)»
您可以使用 uniname
获取代码点的 Unicode 名称的方法/例程:
.uniname.say for "p̄".ords;# 输出:«拉丁文小写字母 PCOMBINING MACRON»
或者只使用 uninames
方法/例程:
.say for "p̄".uninames;# 输出:«拉丁文小写字母 PCOMBINING MACRON»
如果您只想要字符串中的代码点数,可以使用 代码:
说p̄".codes;# 输出:«2»
这与 chars
不同,它只计算字符串中的字符数:
说p̄".chars;# 输出:«1»
另请参阅@hobbs 使用 NFD
的回答.
Two related questions.Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä
, U+00E4
) or two and more combined symbols (like p̄
and ḏ̣
). This little code
my @symb;
@symb.push("ä");
@symb.push("p" ~ 0x304.chr); # "p̄"
@symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for @symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.1) Remove diacritics from ä
. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄
into p
and Combining Macron U+0304
. E.g. something like the following in bash
:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str
class. To do what you are asking in (1), you can use the samemark
method/routine.
Per the documentation:
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords
method to get a List
(technically a Positional
) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)»
You can use the uniname
method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
or just use the uninames
method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
If you just want the number of codepoints in the string, you can use codes
:
say "p̄".codes; # OUTPUT: «2»
This is different than chars
, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1»
Also see @hobbs' answer using NFD
.
这篇关于如何在 Perl 6 中删除变音符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!