问题描述:
现网ElasticSearch health状态变为red,有分片无法assign。如下摘录explain的结果部分:
"note": "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
"index": "demo-2022.02.06",
"shard": 3,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "CLUSTER_RECOVERED",
"at": "2023-05-29T08:08:22.697Z",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",
。。。
"store": {
"in_sync": true,
"allocation_id": "82iRvG0KTTm9NT_5Fx8BRA",
"store_exception": {
"type": "corrupt_index_exception",
"reason": "failed engine (reason: [corrupt file (source: [start])]) (resource=preexisting_corruption)",
"caused_by": {
"type": "i_o_exception",
"reason": "failed engine (reason: [corrupt file (source: [start])])",
"caused_by": {
"type": "corrupt_index_exception",
"reason": "checksum passed (d87020fd). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/data/es/data/nodes/0/indices/dzcoAoZjSzGus0qj1sKTFg/3/index/segments_6\")))"
}
}
}
}
解决方案:
- 步骤1: 检查shard stores
GET /_shard_stores?pretty ,得到分片损坏的明细,以便进行修复,得到如图:
- 步骤2: reroute index
POST /_cluster/reroute?master_timeout=5m
{
"commands": [
{
"allocate_empty_primary": {
"index": "demo-2023.04.04",
"shard": 2 ,
"node": "{nodename}",
"accept_data_loss": true
}
}
]
}