本文介绍了部署到Digital Ocean的Meteor应用停留在100%CPU和OOM上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用Meteor Up(0.8.0)应用程序部署到Digital Ocean的应用程序,该应用程序停留在100%CPU,只能在内存不足的情况下崩溃,然后在100%CPU下重新启动。在过去的24小时里,这种情况一直停留在这种状态。奇怪的是,没有人使用服务器,而meteor.log并没有显示太多线索。

I have a Meteor (0.8.0) app deployed using Meteor Up to Digital Ocean that's been stuck at 100% CPU, only to crash with out of memory, and start up again at 100% CPU. It's been stuck like this for the past 24 hours. The weird part is nobody is using the server and meteor.log isn't showing much clues. I've got MongoHQ with oplog for the database.

Digital Ocean规格:

Digital Ocean specs:

1GB Ram 30GB SSD Disk New York 2 Ubuntu 12.04.3 x64

1GB Ram 30GB SSD Disk New York 2 Ubuntu 12.04.3 x64

显示问题的屏幕截图:

请注意,该屏幕截图是昨天捕获的,一直保持100%cpu的速率,直到崩溃并耗尽内存为止。日志显示:

Note that the screenshot was captured yesterday and it has stayed pegged at 100% cpu until it crashes with out of memory. The log shows:

顶部显示:

26308陨石20 0 1573m 644m 4200 R 98.1 64.7 32:45.36节点

开始方式:
我有一个应用程序可以通过csv或mailchimp oauth接收电子邮件列表,通过他们的批处理流程将他们发送到完全联系方式,然后根据响应状态相应地更新Meteor集合。来自200个响应的摘录

How it started:I have an app that takes in a list of emails via csv or mailchimp oauth, sends them off to fullcontact via their batch process call http://www.fullcontact.com/developer/docs/batch/ and then updates the Meteor collections accordingly depending on the response status. A snippet from a 200 response

if (result.statusCode === 200) {
            var data = JSON.parse(result.content);
            var rate_limit = result.headers['x-rate-limit-limit'];
            var rate_limit_remaining = result.headers['x-rate-limit-remaining'];
            var rate_limit_reset = result.headers['x-rate-limit-reset'];
            console.log(rate_limit);
            console.log(rate_limit_remaining);
            console.log(rate_limit_reset);
            _.each(data.responses, function(resp, key) {
                var email = key.split('=')[1];
                if (resp.status === 200) {
                    var sel = {
                        email: email,
                        listId: listId
                    };
                    Profiles.upsert({
                        email: email,
                        listId: listId
                    }, {
                        $set: sel
                    }, function(err, result) {
                        if (!err) {
                            console.log("Upsert ", result);
                            fullContactSave(resp, email, listId, Meteor.userId());
                        }
                    });
                    RawCsv.update({
                        email: email,
                        listId: listId
                    }, {
                        $set: {
                            processed: true,
                            status: 200,
                            updated_at: new Date().getTime()
                        }
                    }, {
                        multi: true
                    });
                }
                });
                }

在运行Vagrant的Windows笔记本电脑中,本地没有性能问题,无论处理成千上万个一次发送电子邮件。但是在Digital Ocean上,它似乎无法处理15,000个(我已经看到CPU峰值达到100%,然后因OOM而崩溃,但是在出现之后通常会稳定下来……这次不是)。让我担心的是,尽管该应用没有任何活动,服务器还是没有恢复。我已经通过分析来验证了这一点-GA在24小时内总共显示了9个会话,仅比点击/跳动多了一点,而MixPanel在同一时间范围内仅显示了1个已登录用户(我)。自最初失败以来,我唯一要做的就是检查事实软件包,该软件包显示:

Locally on my wimpy Windows laptop running Vagrant, I have no performance issues whatsoever processing hundreds of thousands of emails at a time. But on Digital Ocean, it can't even handle 15,000 it seems (I've seen the CPU spike to 100% and then crash with OOM, but after it comes up it usually stabalizes... not this time). What worries me is that the server hasn't recovered at all despite no/little activity on the app. I've verified this by looking at analytics - GA shows 9 sessions total over the 24 hours doing little more than hitting / and bouncing, MixPanel shows only 1 logged in user (me) in the same timeframe. And the only thing I've done since the initial failure is check the facts package, which shows:

oplog观察器16观察句柄15在查询阶段花费了时间

oplog-watchers 16 observe-handles 15 time-spent-in-QUERYING-phase

87828 FETCHING阶段花费的时间82个实时数据

87828 time-spent-in-FETCHING-phase 82 livedata

invalidation-crossbar-listeners 16个订阅11个会话1

invalidation-crossbar-listeners 16 subscriptions 11 sessions 1

流星APM也没有显示任何异常,meteor.log除了OOM和重新启动消息之外,没有显示任何流星活动。 MongoHQ不会报告任何运行缓慢的查询或活动过多-盯着他们的监控仪表板,对avg进行0次查询,更新,插入,删除。据我所知,过去24小时没有太多活动,当然也没有任何密集的活动。此后,我尝试安装newrelic和nodetime,但都无法正常工作-newrelic不显示任何数据,并且meteor.log包含nodetime调试消息

Meteor APM also doesn't show anything out of the ordinary, the meteor.log doesn't show any meteor activity aside from the OOM and restart messages. MongoHQ isn't reporting any slow running queries or much activity - 0 queries, updates, inserts, deletes on avg from staring at their monitoring dashboard. So as far as I can tell, there hasn't been much activity for 24 hours, and certainly not anything intensive. I've since tried to install newrelic and nodetime but neither is quite working - newrelic shows no data and the meteor.log has a nodetime debug message

加载的nodetime-native扩展失败

因此,当我尝试使用nodetime的CPU事件探查器时,它变成空白,并且堆快照返回并显示错误:未加载V8工具。

So when I try to use nodetime's CPU profiler it turns up blank and the heap snapshot returns with Error: V8 tools are not loaded.

在这一点上,我基本上是个主意,由于Node对我来说还很陌生,所以感觉就像我在黑暗中taking了刺。

I'm basically out of ideas at this point, and since Node is pretty new to me it feels like I'm taking wild stabs in the dark here. Please help.

更新:四天后服务器仍以100%固定。甚至init 6也无济于事-服务器重新启动,节点进程启动,然后跳回100%cpu。我尝试了其他工具,例如memwatch和webkit-devtools-agent,但无法使它们与Meteor一起使用。

Update: Server is still pegged at 100% four days later. Even an init 6 doesn't do anything - Server restarts, node process starts and jumps back up to 100% cpu. I tried other tools like memwatch and webkit-devtools-agent but could not get them to work with Meteor.

以下是strace输出

The following is the strace output

进程6840已附加-退出中断

Process 6840 attached - interrupt to quit

^ C进程6840已分离

^CProcess 6840 detached

%时间秒数usecs /呼叫错误syscall

% time seconds usecs/call calls errors syscall

77.17 0.073108 1 113701 epoll_wait

77.17 0.073108 1 113701 epoll_wait

11.15 0.010559 0 80106 39908 mmap

11.15 0.010559 0 80106 39908 mmap

6.66 0.006309 0 116907已读

6.66 0.006309 0 116907 read

2.09 0.001982 0 84445 futex

2.09 0.001982 0 84445 futex

1.49 0.001416 0 45176写入

1.49 0.001416 0 45176 write

0.68 0.000646 0 119975 munmap

0.68 0.000646 0 119975 munmap

0.58 0.000549 0 227402 clock_gettime

0.58 0.000549 0 227402 clock_gettime

0.10 0.000095 0 117617 rt_sigprocmask

0.10 0.000095 0 117617 rt_sigprocmask

0.04 0.000040 0 30471 epoll_ct l

0.04 0.000040 0 30471 epoll_ctl

0.03 0.000031 0 71428 gettimeofday

0.03 0.000031 0 71428 gettimeofday

0.00 0.000000 0 36 mprotect

0.00 0.000000 0 36 mprotect

0.00 0.000000 0 4 brk

0.00 0.000000 0 4 brk

100.00 0.094735 1007268 39908总计

100.00 0.094735 1007268 39908 total

所以看起来节点进程花费了

So it looks like the node process spends most of its time in epoll_wait.

推荐答案

我遇到了类似的问题。我不需要Oplog,建议我添加流星软件包 disable-oplog。因此我做到了,CPU使用率大大降低了。如果您没有真正利用Oplog,最好禁用它,所以 meteor添加disable-oplog 看看会发生什么。

I had a similar issue. I didn't need Oplog and I was suggested to add meteor package "disable-oplog". So I did, and the CPU usage was reduced a lot. If you are not really taking advantage of Oplog it might be better to disable it, so do meteor add disable-oplog and see what happens.

我希望这会有所帮助。

这篇关于部署到Digital Ocean的Meteor应用停留在100%CPU和OOM上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-01 20:31