本文介绍了如何阅读写作字号码值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家都知道的数字可以写无论是在数字,或者叫他们的名字。虽然有很多例子可以发现,转换成123 123,我无法找到如何将它转化成其他方式很好的例子。

As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.

一些需要注意的地方:

  1. 在红衣主教/名义或有序:一和第一
  2. 在常见的拼写错误:四十/四十了
  3. 在数百/千:2100 - >21百,也有2001百
  4. 分隔符:十一一百五十二,同时也是elevenhundred fiftytwo或十一一百五十二和诸如此类的东西
  5. colloqialisms:三十出头
  6. 片段:三分之一,五分之二
  7. 在共同的名字:一打,半壁江山

和有可能更警告可能都还没有上市。假设算法需要非常强大的,甚至了解拼写错误。

And there are probably more caveats possible that are not yet listed.Suppose the algorithm needs to be very robust, and even understand spelling mistakes.

哪些字段/纸/学习/算法,我应该读学写这一切?哪里的信息?

What fields/papers/studies/algorithms should I read to learn how to write all this?Where is the information?

PS:我最后的解析器实际上应该理解3种不同的语言,英语,俄语和希伯来语。也许在稍后阶段更多的语言将被添加。希伯来文也有男/女号,如一个人,一女有不同的一,EHAD和ahat。俄罗斯也有一些它自己的复杂性。

谷歌做了伟大的工作,在这一点,例如:

Google does a great job at this, for example:

<一个href="http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal">http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal

(反过来也是可能的<一href="http://www.google.com/search?q=999999999999+in+english">http://www.google.com/search?q=999999999999+in+english)

推荐答案

我是用PEG解析器做你想要的东西(可能张贴作为一个单独的答案后)时,我注意到,有一个非常简单的算法玩弄这确实与数字的英语,西班牙语和德语常见的形式是一个非常不错的工作,最起码。

I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.

使用英语工作,例如,你需要的单词映射到值的明显有道词典:

Working with English for example, you need a dictionary that maps words to values in the obvious way:

"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000

...等等

...and so forth

该算法就是:

total = 0
prior = null
for each word w
    v <- value(w) or next if no value defined
    prior <- case
        when prior is null:       v
        when prior > v:     prior+v
        else                prior*v
        else
    if w in {thousand,million,billion,trillion...}
        total <- total + prior
        prior <- null
total = total + prior unless prior is null

例如,这种进展如下:

total    prior      v     unconsumed string
    0      _              four score and seven
                    4     score and seven
    0      4
                   20     and seven
    0     80
                    _     seven
    0     80
                    7
    0     87
   87

total    prior      v     unconsumed string
    0        _            two million four hundred twelve thousand eight hundred seven
                    2     million four hundred twelve thousand eight hundred seven
    0        2
                  1000000 four hundred twelve thousand eight hundred seven
2000000      _
                    4     hundred twelve thousand eight hundred seven
2000000      4
                    100   twelve thousand eight hundred seven
2000000    400
                    12    thousand eight hundred seven
2000000    412
                    1000  eight hundred seven
2000000  412000
                    1000  eight hundred seven
2412000     _
                      8   hundred seven
2412000     8
                     100  seven
2412000   800
                     7
2412000   807
2412807

等。我不是说这是完美的,但对于一个快速和肮脏它确实相当不错。

And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.


解决您的具体名单编辑:

Addressing your specific list on edit:

  1. 在红衣主教/名义或有序:一和第一 - 只是把它们在字典中
  2. 英语/英:四十了/四 - 同上
  3. 在数百/千:  2100 - >21百,也有2001百 - 的工作原理是
  4. 分隔符:十一一百五十二,同时也是elevenhundred fiftytwo或十一一百五十二和诸如此类的东西 - 只是定义下一个字是最长的preFIX匹配一个定义的话,或直到下一个不字,如果不这样做,对于一个开始
  5. colloqialisms:三十出头 - 工作
  6. 片段:三分之一,五分之二 - 呃,没有......
  7. 在共同的名字:一打,半壁江山 - 工程;你甚至可以做的事情一样半打
  1. cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
  2. english/british: "fourty"/"forty" -- ditto
  3. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
  4. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
  5. colloqialisms: "thirty-something" -- works
  6. fragments: 'one third', 'two fifths' -- uh, not yet...
  7. common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"

6号是唯一一个我没有现成的答案,那是因为序数和分数之间的歧义(英文最少)加入到一个事实,即我最后一杯咖啡是的的几小时前。

Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.

这篇关于如何阅读写作字号码值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 08:57