20 * 365.25 * 24 * 60 * 60 *(1 - 0.999999999)== 0.631 s
这意味着系统只有不到一个二十年来第二次停机。我不是在质疑这一点的有效性,我只是好奇,我们可以关闭系统(故意或意外)只有0.631秒。熟悉大型软件系统的人都可以向我们解释这一点吗?有没有人知道如何通过一组处理单元(或机器)计算服务的停机时间。 ?
可靠性数据不应该是衡量总体时间的任何部分 AXD301
系统提供的服务已经脱机的总时间。细微差别正如Joe Armstrong所说::
AXD301/$ c>(这是他曾经说过的,避免细节)。这不一定意味着Erlang是这样高可靠性的唯一原因。
编辑:其实20年本身似乎是一个误解。 Joe在同一篇文章中提到了20年的数字,但它并没有与九个可靠性数据实际相关,这可能性远远低于其他人提到的研究(
Erlang was reported to have been used in production systems for over 20 years with an uptime percentage of 99.9999999%.
I did the math as the following:
20*365.25*24*60*60*(1 - 0.999999999) == 0.631 s
That means the system only has less than one second of downtime during the period of 20 years. I am not trying to challenge the validity of this, I am just curious about how we can shut down a system (on purpose or by accident) for only 0.631 second. Could anyone who are familiar with large software system explain this to us? Thank you.
Does anyone know how to calculate the downtime of a service over a cluster of processing units (or machines)?
解决方案 The reliability figure wasn't supposed to measure the total time any part of AXD301
(project in question) was ever shut down for over 20 years. It represents the total time over those 20 years that the service provided by the AXD301
system was ever offline. Subtle difference. As Joe Armstrong says here:
If you dig a bit deeper, in the PhD thesis written by Joe, the original author of Erlang (which includes a case study of AXD301
), you read:
So, as long as the network that the switch was a part of was running without downtime, the author can state "nine nines reliability" for AXD301
(which was all he ever said, avoiding specifics). It doesn't necessarily mean Erlang is the only cause of such high reliability.
EDIT: In fact, "20 years" itself seems like a misinterpretation. Joe mentions a figure of 20 years in the same article, but it's not actually connected to the nine-nines reliability figure, which potentially came out of a much shorter study (as others have mentioned).