Elasticsearch 5.1.1升级6.7.2小结(2)
接上文:Elasticsearch 5.1.1升级6.7.2小结(1)
2 处理升级过程中的各种问题
2.1 更新配置文件
接上文启动失败,仔细检视安装过程,安装过程中的几个warning引起了我的注意:
Updating / installing...
1:elasticsearch-0:6.7.2-1
warning: /etc/elasticsearch/elasticsearch.yml created as /etc/elasticsearch/elasticsearch.yml.rpmnew
warning: /etc/sysconfig/elasticsearch created as /etc/sysconfig/elasticsearch.rpmnew
warning: /usr/lib/systemd/system/elasticsearch.service created as /usr/lib/systemd/system/elasticsearch.service.rpmnew
很显然,因为配置文件已经存在,elasticsearch安装过程并没有帮你覆盖,而是将新的配置文件以.rpmnew结尾存在同路径下。所以,我们需要手工来合并elasticsearch的配置文件。这个属于elasticsearch安装的过程,相信做升级的人都有所了解,我就不详述了。
2.2 修复elasticsearch.keystore
配置文件修改好之后,你以为就可以启动了吗?Naive!反手就是一个报错:
[root@LPT0268 elasticsearch]# service elasticsearch start
Starting elasticsearch (via systemctl): [ OK ]
[yuliangwang@LPT0268 ~]$ systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2019-06-28 16:51:37 CST; 4s ago
Docs: http://www.elastic.co
Process: 11905 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=exited, status=1/FAILURE)
Process: 13624 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=203/EXEC)
Main PID: 11905 (code=exited, status=1/FAILURE)
可以看到错误并没有消除,并且/var/log/elasticsearch/
目录下没有任何日志。这个问题我搞了很久,突发奇想,我去$ES_HOME/bin
目录直接执行elasticsearch脚本,结果终于看到错误信息了:
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.io.EOFException: read past EOF: SimpleFSIndexInput(path="/etc/ela...rch.keystore")
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: Likely root cause: java.io.EOFException: read past EOF: SimpleFSIndexInput(path="/etc/elasticsearch/elasticsearch.keystore")
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:336)
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
Jul 01 10:18:06 LPT0268 elasticsearch[1345]: at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
Jul 01 10:18:06 LPT0268 systemd[1]: elasticsearch.service: main process exited, code=exited, status=1/FAILURE
Jul 01 10:18:06 LPT0268 systemd[1]: Unit elasticsearch.service entered failed state.
Jul 01 10:18:06 LPT0268 systemd[1]: elasticsearch.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
原来是我们刚才创建的elasticsearch.keystore这个货导致的,通过查阅文档发现,不能手动创建elasticsearch.keystore文件,因为他是ES自带的密钥库相关的文件,必须使用命令来创建,
sudo bin/elasticsearch-keystore create
官方说明:
https://www.elastic.co/guide/en/elasticsearch/reference/current/secure-settings.html
通过官方文档也能解释,为什么我们修改配置文件没有影响报错。因为elasticsearch本身是先加载elasticsearch.keystore,再去加载配置文件的。
2.3 修复bootstrap.memory_lock
再启动ES,然而还是报错(内心崩溃,经过一个周末的修复才康复):
[yuliangwang@LPT0268 bin]$ systemctl status elasticsearch
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/elasticsearch.service.d
└─override.conf
Active: failed (Result: exit-code) since Mon 2019-07-01 11:04:32 CST; 20s ago
Docs: http://www.elastic.co
Process: 4898 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=78)
Main PID: 4898 (code=exited, status=78)
好消息是,终于我们有log了,打开/var/log/elasticsearch/
,摘录有问题的信息:
[2019-07-01T10:54:12,406][WARN ][o.e.b.JNANatives ] [unknown] Unable to lock JVM Memory: error=12, reason=Cannot allocate memory
[2019-07-01T10:54:12,409][WARN ][o.e.b.JNANatives ] [unknown] This can result in part of the JVM being swapped out.
[2019-07-01T10:54:12,409][WARN ][o.e.b.JNANatives ] [unknown] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536
[2019-07-01T10:54:12,409][WARN ][o.e.b.JNANatives ] [unknown] These can be adjusted by modifying /etc/security/limits.conf, for example:
# allow user 'elasticsearch' mlockall
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
[2019-07-01T10:54:12,409][WARN ][o.e.b.JNANatives ] [unknown] If you are logged in interactively, you will have to re-login for the new limits to take effect.
......
[2019-07-01T10:54:21,051][ERROR][o.e.b.Bootstrap ] [G1bC4Hf] node validation exception
[1] bootstrap checks failed
[1]: memory locking requested for elasticsearch process but memory is not locked
可以看到锁定内存有关。这个是因为我们在配置中添加了关闭swap的配置:
bootstrap.memory_lock: true
关闭swap可以防止OS将内存也置换到磁盘中,根据官方文档的说法,可以防止很慢的GC:
https://www.elastic.co/guide/en/elasticsearch/reference/6.7/setup-configuration-memory.html
根据文档,来配置/etc/systemd/system/elasticsearch.service.d/override.conf
,设置值为:
[Service]
LimitMEMLOCK=infinity
然后刷新
sudo systemctl daemon-reload
再次启动后终于成功:
[root@LPT0268 elasticsearch]# sudo service elasticsearch restart
Restarting elasticsearch (via systemctl): [ OK ]
[root@LPT0268 elasticsearch]# systemctl status elasticsearch
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/elasticsearch.service.d
└─override.conf
Active: active (running) since Mon 2019-07-01 13:57:51 CST; 14s ago
Docs: http://www.elastic.co
Main PID: 15294 (java)
CGroup: /system.slice/elasticsearch.service
├─15294 /bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negat...
└─15375 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
9200端口返回版本为6.7.2,:
{
"name": "G1bC4Hf",
"cluster_name": "psylocke-fws-oy",
"cluster_uuid": "PDI23Ik4TAGx10mMocqGLQ",
"version": {
"number": "6.7.2",
"build_flavor": "default",
"build_type": "rpm",
"build_hash": "56c6e48",
"build_date": "2019-04-29T09:05:50.290371Z",
"build_snapshot": false,
"lucene_version": "7.7.0",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"
},
"tagline": "You Know, for Search"
}
3 恢复集群
依次升级集群内的每台机器后,启动集群,通过GET _cat/health
查看集群状态。此时集群状态是red,通过GET _cat/shards
可以看到主分片都已经started了,但是从分片还是失效状态,通过以下命令恢复集群routing
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
可以看到集群变为yellow,开始恢复:
1561970784 08:46:24 psylocke-fws-oy yellow 1 1 17 17 0 0 15 0 - 53.1%
等待集群恢复后,升级完毕!