本文介绍了H2O服务器崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

去年我一直在与H2O合作,对服务器崩溃感到非常厌倦.我已经放弃了每晚发布",因为它们很容易因我的数据集而崩溃.请告诉我在哪里可以下载稳定的版本.

I've been working with H2O for the last year, and I am getting very tired of server crashes. I have given up on "nightly releases", as they are easily crashed by my data sets. Please tell me where I can download a release that is stable.

查尔斯

我的环境是:

  • Windows 10企业版,内部版本1607,具有64 GB内存.
  • Java SE开发工具包8 Update 77(64位).
  • Anaconda Python 3.6.2-0.

我通过以下方式启动服务器:

I started the server with:

localH2O = h2o.init(ip = "localhost",
                    port = 54321,
                    max_mem_size="12G",
                    nthreads = 4)

h2o初始化信息为:

The h2o init information is:

H2O cluster uptime:         12 hours 12 mins
H2O cluster version:        3.10.5.2
H2O cluster version age:    1 month and 6 days
H2O cluster name:           H2O_from_python_Charles_ji1ndk
H2O cluster total nodes:    1
H2O cluster free memory:    6.994 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  4
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
Python version:             3.6.2 final

崩溃信息是:

OSError: Job with key $03017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
stacktrace:
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
    at water.MemoryManager.set_goals(MemoryManager.java:97)
    at water.MemoryManager.malloc(MemoryManager.java:265)
    at water.MemoryManager.malloc(MemoryManager.java:222)
    at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291)
    at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719)
    at water.AutoBuffer.putA4f(AutoBuffer.java:1355)
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java)
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.AutoBuffer.put(AutoBuffer.java:771)
    at water.AutoBuffer.putA(AutoBuffer.java:883)
    at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java)
    at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.AutoBuffer.put(AutoBuffer.java:771)
    at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java)
    at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.Iced.asBytes(Iced.java:42)
    at water.Value.<init>(Value.java:348)
    at water.TAtomic.atomic(TAtomic.java:22)
    at water.Atomic.compute2(Atomic.java:56)
    at water.Atomic.fork(Atomic.java:39)
    at water.Atomic.invoke(Atomic.java:31)
    at water.Lockable.unlock(Lockable.java:181)
    at water.Lockable.unlock(Lockable.java:176)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

推荐答案

您需要更大的船.

错误消息说"heapUsedGC = 11482667352",该值高于MEM_MAX.为何不给max_mem_size="12G"呢,为什么不给它更多的64GB空间呢?或建立一个雄心勃勃的模型(更少的隐藏节点,更少的训练数据等等).

The error message is saying "heapUsedGC=11482667352", which is higher than MEM_MAX. Instead of giving max_mem_size="12G" why not give it more of the 64GB you have? Or build a less ambitious model (fewer hidden nodes, less training data, something like that).

(显然,理想情况下,h2o不应崩溃,而应在接近使用所有可用内存时优雅地中止.如果您能够与H2O共享数据/代码,则可能值得打开他们的JIRA的错误报告.)

(Obviously, ideally, h2o shouldn't be crashing, and should instead be gracefully aborting when it gets close to using all the available memory. If you are able to share your data/code with H2O, it might be worth opening a bug report on their JIRA.)

顺便说一句,我已经将h2o 3.10.x.x作为Web服务器进程的后端运行了9个月左右,它在周末自动重启,并且没有发生任何崩溃.好吧,我做到了–我让它运行3周后,它用越来越多的数据和模型填满了所有内存.这就是为什么我将其切换为每周重新启动,并且仅将所需模型保留在内存中的原因. (顺便说一下,这是在一个4GB内存的AWS实例上;通过cron job和bash命令完成重启.)

BTW, I've been running h2o 3.10.x.x as the back-end for a web server process for 9 months or so, automatically restarting it at weekends, and haven't had a single crash. Well, I did - after I left it running 3 weeks and it filled up all the memory with more and more data and models. That is why I switched it to restart weekly, and only keep in memory the models I needed. (This is on an AWS instance, 4GB of memory, by the way; restarts done by cron job and bash commands.)

这篇关于H2O服务器崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 01:05