问题描述
此问题遵循上一个问题,但有所不同. Synopse的delphi断字非常快,它基于OpenOffice 使用TeX断字的libhnj库.
This question follows previous question but different. Synopse's delphi hyphenation is very fast and builts on OpenOffice libhnj library that uses TeX hyphenation.
一个简单的测试是:
如果我输入发音",则摘要连字符输出"pro = nun = ci = ation"(4个可能的连字符或音节). //(不是'pro = nun = ci = a = tion',5个连字符或音节).
If I input 'pronunciation', the Synopse hyphenation outputs 'pro=nun=ci=ation' (4 possible hyphens or syllables). //(not 'pro=nun=ci=a=tion', 5 hyphens or syllables).
我阅读了2篇论文(此处和此处).作者指出音节化的准确率约为95%.我仅对 CMU发音词典上的音节计数进行了Synopse断字测试,但是仅约53%的准确性.
I read 2 papers (here and here) about Tex hyphenation algorithm uses in syllabification. Authors stated about 95% accuracy in syllabification. I tested Synopse hyphenation only for counting syllables on CMU Pronouncing Dictionary, but only about 53% accuracy.
为什么结果明显不同?
Why is the result significantly different?
我将详细介绍我的方法.
I reproduce my method in a little detailed way.
我解析CMU的发音词典来计算所有单词数.CMU dic就像:
I parse the CMU pronuncing dictionary to compute all number of words.The CMU dic is like:
PRONOUNS P R OW1 N AW0 N Z
PRONOVOST P R OW0 N OW1 V OW0 S T
PRONTO P R AA1 N T OW0
PRONUNCIATION P R OW0 N AH2 N S IY0 EY1 SH AH0 N
PRONUNCIATION(1) P R AH0 N AH2 N S IY0 EY1 SH AH0 N
我将得到以下结果:
PRONOUNS=2
PRONOVOST=3
PRONTO=2
PRONUNCIATION(1)=5 // will be ignored
PRONUNCIATION=5 // use this one
与Synopse断字库相比,带括号的单词将被忽略.它们是替代发音或次要发音(变体).
Words with parentheses will be ignored when compared with the Synopse hyphenation lib. They are alternative or secondary pronunciations (variants).
类似地,我将使用连字符库来计算CMU词典中每个单词的音节数.然后,我将两者进行比较,看有多少匹配.音节数量不同的单词记录如下:
Similarly, I will use the hyphenation lib to compute the number of syllables of each word in the CMU dictionary. Then I compare the two to see how many match. The words with different numbers of syllables are recorded like below:
...
94814 cmu PROMULGATED=4 | PROMULGATED=3 Synopse Hyphenation
94821 cmu PRONGER=2 | PRONGER=1 Synopse Hyphenation
94829 cmu PRONOUNCES=3 | PRONOUNCES=2 Synopse Hyphenation
94833 cmu PRONTO=2 | PRONTO=1 Synopse Hyphenation
94835 cmu PRONUNCIATION=5 | PRONUNCIATION=4 Synopse Hyphenation
...
CMU的总行数为123611(不包括带括号的行和不带有意义单词的行,例如引号行'(').两者的 same 单词的总音节数目不同:57870.
The total line number of CMU is 123611 (excluding lines with parentheses and lines without meaningful words, like quotation mark lines '(').The total different number of syllables of the same words for the two: 57870.
CMU可能不是音节数字的标准.在该测试中,(123611-57870)/123611=53.183%.这与作者在上面的论文中指出的准确率显着不同.当然,他们使用另一个数据库(CELEX)进行测试.为什么结果如此不同?
CMU may not be the standard of syllable numbers. In this test, (123611-57870)/123611=53.183%. This is significantly different from the accuracy rate stated by the author in their paper above. Of course, they used a another database (CELEX) for their tests. Why is the result so different?
Synopse断字库非常快.我想进一步了解这是否归因于特征码文件(最初用于OpenOffice的libhnj用于断字的dic文件).还是论文的作者使用了其他词典文件?
The Synopse hyphenation library is very fast. I want to know further if this is due to the pattern file (dic file used for hyphenation originally from libhnj used in OpenOffice). Or did the author of the paper use a different dictionary file?
推荐答案
简而言之,我相信 SPIRE 2009 论文与此处报告的结果是因为我们训练了该方法,而不是使用先前训练生成的模式(据我所知,这就是您在这里所做的事情).
In short, I believe the reason that the difference in accuracy is so great between what was reported in our SPIRE 2009 paper and the results being reported here is because we trained the method instead of using patterns generated through prior training (which, from what I can gather, is what you are doing here).
我们在论文的第三页(第176页)上简要介绍了如何进行训练以获取模式,我们论文的第4.3节对此进行了更详细的介绍,您可以在此处找到: http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf
How we performed training to obtain our patterns is described briefly on the third page of our paper (pg.176) and in more detail in Section 4.3 of my thesis which you can find here:http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf
这篇关于为什么Synopse连字代码给出的结果与TeX不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!