我使用datastax有6个节点,1个Solr,5个Spark节点。我的集群位于与Amazon EC2类似的服务器上,具有EBS卷。每个节点具有3个EBS卷,这些卷使用LVM组成逻辑数据磁盘。在我的OPS中心中,同一节点经常无响应,这导致我的数据系统的连接超时。我的数据量约为400GB,包含3个副本。我每分钟有20个具有批处理间隔的流式作业。这是我的错误信息:

/var/log/cassandra/output.log:WARN 13:44:31,868 Not marking nodes down due to local pause of 53690474502 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:40:34,944 FailureDetector.java:258 - Not marking nodes down due to local pause of 64532052919 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:59:12,023 FailureDetector.java:258 - Not marking nodes down due to local pause of 66027485893 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-26 13:44:31,868 FailureDetector.java:258 - Not marking nodes down due to local pause of 53690474502 > 5000000000


编辑:

这些是我更具体的配置。我想知道我是否做错了什么,如果是的话,如何详细了解它是什么以及如何解决?

出堆设置为

MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"


当前堆:

[root@iZ11xsiompxZ ~]# jstat -gc 11399
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
 0.0   196608.0  0.0   196608.0 6717440.0 2015232.0 43417600.0 23029174.0 69604.0 68678.2  0.0    0.0     1041  131.437   0      0.000  131.437
[root@iZ11xsiompxZ ~]# jmap -heap 11399
Attaching to process ID 11399, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14

using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)


堆配置:

MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 51539607552 (49152.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 30920409088 (29488.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 16777216 (16.0MB)


堆使用情况:

G1 Heap:
   regions  = 3072
   capacity = 51539607552 (49152.0MB)
   used     = 29923661848 (28537.427757263184MB)
   free     = 21615945704 (20614.572242736816MB)
   58.059545404588185% used
G1 Young Generation:
Eden Space:
   regions  = 366
   capacity = 6878658560 (6560.0MB)
   used     = 6140461056 (5856.0MB)
   free     = 738197504 (704.0MB)
   89.26829268292683% used
Survivor Space:
   regions  = 12
   capacity = 201326592 (192.0MB)
   used     = 201326592 (192.0MB)
   free     = 0 (0.0MB)
   100.0% used
G1 Old Generation:
   regions  = 1443
   capacity = 44459622400 (42400.0MB)
   used     = 23581874200 (22489.427757263184MB)
   free     = 20877748200 (19910.572242736816MB)
   53.04110320109241% used

40076 interned Strings occupying 7467880 bytes.


我不知道为什么会这样。非常感谢。

最佳答案

您看到的Not marking nodes down due to local pause消息是由于JVM暂停。尽管您通过发布JVM信息在这里做一些事情,但是通常一个不错的起点就是查看/var/log/cassandra/system.log,例如检查诸如ERRORWARN之类的内容。还要通过grepping GCInspector检查GC事件的长度和频率。

诸如nodetool tpstats之类的工具在这里是您的朋友,以查看您是否备份或删除了突变,阻止了刷新编写器等。

这里的文档有一些好东西要检查:https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html

还要检查您的节点是否具有建议的生产设置,这通常被忽略:

http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html

还要注意的一件事是,Cassandra对I / O相当敏感,“正常” EBS可能不够快,无法满足您在此处需要的需求。将Solr也加入到混合中,当您同时执行Cassandra压缩和Lucene Merge进入磁盘时,您会看到很多I / O争用。

07-24 09:38
查看更多