Elasticsearch索引分片说明

我试图弄清楚 Elasticsearch 索引的概念，但完全不了解。我想提前提出几点。我了解反向文档索引的工作原理(将术语映射到文档ID)，也了解基于TF-IDF的文档排名如何工作。我不了解的是实际索引的数据结构。当引用 Elasticsearch 文档时，它将索引描述为“具有映射到文档的表”。所以，分片来了!当您查看 Elasticsearch 索引的典型图片时，其表示如下:

图片没有显示出实际的分区是如何发生的，以及该[table-> document]链接如何拆分为多个分片。例如，每个分片都垂直拆分表吗？意味着倒排索引表仅包含分片上存在的术语。例如，假设我们有3个分片，这意味着第一个分片将包含document1，第二个分片仅包含文档2，第三个分片为document3。现在，第一个分片索引是否仅包含document1中存在的术语？在这种情况下[蓝色，明亮，蝴蝶，微风，挂起]。如果是这样，如果有人搜索[忘记]， Elasticsearch 如何“知道”不搜索分片1，或者每次都搜索所有分片？
当您查看群集镜像时:

尚不清楚shard1，shard2和shard3中到底是什么。我们从术语-> DocumentId->文档转到“矩形”分片，但是分片到底包含什么？

如果有人可以从上面的图片中进行解释，我将不胜感激。

最佳答案

理论

Elastichsarch建立在Lucene之上。每个分片都只是一个Lucene索引。如果简化，Lucene索引就是倒排索引。每个Elasticsearch索引都是一堆碎片或Lucene索引。当您用查询文档的时，Elasticsearch将对所有分片进行子查询，合并结果并将其返回给您。当您将文档的索引索引到Elasticsearch时，Elasticsearch将使用公式计算要在哪个分片文档中写入

shard = hash(routing) % number_of_primary_shards

默认情况下，Elasticsearch作为路由使用文档id。如果指定routing参数，它将代替id使用。您可以在搜索查询以及为索引，删除或更新文档的请求中使用routing参数。
默认使用MurmurHash3作为哈希函数

例子

创建具有3个分片的索引
$ curl -XPUT localhost:9200/so -d ' { "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 0 } } }'

索引文件
$ curl -XPUT localhost:9200/so/question/1 -d ' { "number" : 47011047, "title" : "need elasticsearch index sharding explanation" }'

无路由查询
$ curl "localhost:9200/so/question/_search?&pretty"
回复

查看_shards.total-这是被查询的许多分片。另请注意，我们找到了该文档
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "so", "_type" : "question", "_id" : "1", "_score" : 1.0, "_source" : { "number" : 47011047, "title" : "need elasticsearch index sharding explanation" } } ] } }

查询正确的路由
$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"
回复
_shards.total现在为1，因为我们指定了路由，elasticsearch知道了要查询文档的分片。使用param explain=true，我要求elasticsearch提供有关查询的其他信息。注意hits._shard-它已设置为[so][2]。这意味着我们的文档存储在so索引的第二个分片中。
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_shard" : "[so][2]", "_node" : "2skA6yiPSVOInMX0ZsD91Q", "_index" : "so", "_type" : "question", "_id" : "1", "_score" : 1.0, "_source" : { "number" : 47011047, "title" : "need elasticsearch index sharding explanation" }, ... }

查询路由不正确
$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"
回复

_shards.total再次1.但是，Elasticsearch不返回任何查询内容，因为我们指定了错误的路由，并且Elasticsearch查询了没有文档的分片。
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }

附加信息
An excellent explanation of the Lucene internals from Adrien Grande
A Dive into the Elasticsearch Storage by Njal Karevoll
Routing a Document to a Shard
关于Elasticsearch索引分片说明，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47003336/