perl - 需要拆分Unicode字符串

我正在为我的翻译系统使用moses工具包。我正在使用阿萨姆语和英语平行语料库并对其进行了培训。但是某些专有名词未翻译。这是因为我的语料库非常小(并行数据集)。所以我想在我的翻译系统中使用音译过程。

我正在使用此命令进行翻译:echo'কানাদাএখনদেশ'| 〜/ mymoses / bin / moses -f〜/ work / mert-work / moses.ini

这给了我输出“কানাদা是一个广阔的国家”。

这是因为单词“কানাদা”不在我的平行语料库中。

因此，我采用了一些平行的阿萨姆语和英语单词列表，并将每个单词按字符分开。因此，两个文件的每一行将具有单个单词，并且每个字符(或每个音节)之间都有一个空格。我已使用这2个文件将系统训练为普通翻译任务

然后我使用以下命令echo'echoকানাদাএখন'| 〜/ mymoses / bin / moses -f〜/ work / mert-work / moses.ini | ./space.pl

这给了我输出“ক是一个广阔的国家”

我不得不打断这个词，因为我已经按角色训练了系统。

然后，我使用通过以下命令训练的音译系统:

回声“কানাদা” 〜/ mymoses / bin / moses -f〜/ work / mert-work / moses.ini | ./space.pl | 〜/ mymoses / bin / moses -f〜/ work1 / train / model / moses.ini

这给了我输出“加拿大是一个广阔的国家”

字符是音译的..但是唯一的问题是单词之间的空格。因此，我想使用将加入单词的perl文件。我的最后命令是

回声“কানাদা” 〜/ mymoses / bin / moses -f〜/ work / mert-work / moses.ini | ./space.pl | 〜/ mymoses / bin / moses -f〜/ work1 / train / model / moses.ini | ./join.pl

使用此“join.pl”文件帮助我。

最佳答案

怎么样:

use utf8;
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
say $str;

输出:

ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত

您可以在程序中使用它，只需将while循环更改为:

while(<>) {
    s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    print $_;
}

但我想您愿意:

my %corresp = (
    'ভ' => 'Bh',
    'া' => 'a',
    'ৰ' => 'ra',
    'ত' => 't',
);
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
say $str;

输出:

Bharat is a famous country. দিল্লী is the capital of Bharat

注意:由您来建立真正的对应哈希。我对阿萨姆语字符一无所知。