问题描述
此问题遵循上一个问题,但有所不同.Synopse 的 delphi 连字 速度非常快,并且基于 OpenOffice 使用 TeX 连字的 libhnj 库.
This question follows previous question but different. Synopse's delphi hyphenation is very fast and builts on OpenOffice libhnj library that uses TeX hyphenation.
一个简单的测试是:
如果我输入发音",Synopse 连字符输出pro=nun=ci=ation"(4 个可能的连字符或音节).//(不是pro=nun=ci=a=tion",5 个连字符或音节).
If I input 'pronunciation', the Synopse hyphenation outputs 'pro=nun=ci=ation' (4 possible hyphens or syllables). //(not 'pro=nun=ci=a=tion', 5 hyphens or syllables).
我阅读了 2 篇论文(这里和此处) 关于 Tex 连字算法在音节化中的使用.作者表示音节化的准确率约为 95%.我在 CMU Pronouncing Dictionary 上测试了 Synopse 连字符,但仅用于计算音节,但是只有大约 53% 的准确率.
I read 2 papers (here and here) about Tex hyphenation algorithm uses in syllabification. Authors stated about 95% accuracy in syllabification. I tested Synopse hyphenation only for counting syllables on CMU Pronouncing Dictionary, but only about 53% accuracy.
为什么结果明显不同?
我以稍微详细的方式重现了我的方法.
I reproduce my method in a little detailed way.
我解析 CMU 发音词典以计算所有单词数.CMU dic 是这样的:
I parse the CMU pronuncing dictionary to compute all number of words.The CMU dic is like:
PRONOUNS P R OW1 N AW0 N Z
PRONOVOST P R OW0 N OW1 V OW0 S T
PRONTO P R AA1 N T OW0
PRONUNCIATION P R OW0 N AH2 N S IY0 EY1 SH AH0 N
PRONUNCIATION(1) P R AH0 N AH2 N S IY0 EY1 SH AH0 N
我会得到这个结果:
PRONOUNS=2
PRONOVOST=3
PRONTO=2
PRONUNCIATION(1)=5 // will be ignored
PRONUNCIATION=5 // use this one
与 Synopse 连字库相比,带括号的单词将被忽略.它们是替代或次要发音(变体).
Words with parentheses will be ignored when compared with the Synopse hyphenation lib. They are alternative or secondary pronunciations (variants).
同样,我将使用连字库来计算 CMU 词典中每个单词的音节数.然后我比较两者,看看有多少匹配.不同音节数的词记录如下:
Similarly, I will use the hyphenation lib to compute the number of syllables of each word in the CMU dictionary. Then I compare the two to see how many match. The words with different numbers of syllables are recorded like below:
...
94814 cmu PROMULGATED=4 | PROMULGATED=3 Synopse Hyphenation
94821 cmu PRONGER=2 | PRONGER=1 Synopse Hyphenation
94829 cmu PRONOUNCES=3 | PRONOUNCES=2 Synopse Hyphenation
94833 cmu PRONTO=2 | PRONTO=1 Synopse Hyphenation
94835 cmu PRONUNCIATION=5 | PRONUNCIATION=4 Synopse Hyphenation
...
CMU 的总行数为 123611(不包括带括号的行和没有有意义单词的行,如引号行'(').两者相同词的总不同音节数:57870.
The total line number of CMU is 123611 (excluding lines with parentheses and lines without meaningful words, like quotation mark lines '(').The total different number of syllables of the same words for the two: 57870.
CMU 可能不是音节数字的标准.在本次测试中,(123611-57870)/123611=53.183%.这与作者在上面的论文中所说的准确率有很大的不同.当然,他们使用另一个数据库 (CELEX) 进行测试.为什么结果如此不同?
CMU may not be the standard of syllable numbers. In this test, (123611-57870)/123611=53.183%. This is significantly different from the accuracy rate stated by the author in their paper above. Of course, they used a another database (CELEX) for their tests. Why is the result so different?
Synopse 断字库速度非常快.我想进一步了解这是否是由于模式文件(用于断字的 dic 文件最初来自 OpenOffice 中使用的 libhnj).还是论文作者使用了不同的字典文件?
The Synopse hyphenation library is very fast. I want to know further if this is due to the pattern file (dic file used for hyphenation originally from libhnj used in OpenOffice). Or did the author of the paper use a different dictionary file?
推荐答案
简而言之,我相信我们的 SPIRE 2009 论文中报告的内容与此处报告的结果是因为我们训练了该方法,而不是使用通过先前训练生成的模式(据我所知,这就是您在这里所做的).
In short, I believe the reason that the difference in accuracy is so great between what was reported in our SPIRE 2009 paper and the results being reported here is because we trained the method instead of using patterns generated through prior training (which, from what I can gather, is what you are doing here).
我们如何进行训练以获得我们的模式在我们论文的第三页 (pg.176) 中有简要描述,在我论文的第 4.3 节中有更详细的描述,你可以在这里找到:http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf一个>
How we performed training to obtain our patterns is described briefly on the third page of our paper (pg.176) and in more detail in Section 4.3 of my thesis which you can find here:http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf
这篇关于为什么 Synopse 断字代码给出与 TeX 不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!