本文介绍了ElasticSearch:使用复合的Tenant-ID + page-ID字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始为多租户Web应用程序设计一个 ElasticSearch 映射.在这个应用中,有站点ID:和页面ID :.网页ID:s每个网站唯一,并且随机生成.页面可以有子页面.

I've just starated devising an ElasticSearch mapping for a multitenant web app. In thisapp, there are site ID:s and page ID:s. Page ID:s areunique per site, and randomly generated. Pages can have child pages.

什么是最好的:

1)使用带有站点+页面ID的复合键吗?像这样:

1) Use a compound key with site + page-ID:s? Like so:

"sitePageIdPath": "(siteID):(grandparent-page-ID).(parent-page-ID).(page-ID)"

或:

2)使用单独的字段作为站点ID和页面ID?像这样:

2) Use separate fields for site ID and page IDs? Like so:

"siteId": "(siteID)",
"pageIdPath": "(grandparent-page-ID).(parent-page-ID).(page-ID)"

?

我在想,如果我将站点ID和页面ID合并到一个字段中,那么ElasticSearch将只需要处理 该字段,这应该比使用两个字段-既在建立索引时又在搜索时?并且需要更少的存储空间.

I'm thinking that if I merge site ID and page IDs into one single field, then ElasticSearch will need to handle only that field, and this should be somewhat more performant than using two fields — both when indexing and when searching? And require less storage space.

但是,也许有一些我不知道的缺点?因此,这个问题.

However perhaps there's some drawback that I'm not aware about? Hence this question.

一些详细信息:1)我正在使用一个索引,并且我正在分配碎片(100个碎片),正如有人使用用户"数据流模式.2)我在网址中明确指定路由参数(即& routing = site-ID ),而不通过索引文档中的任何 siteId 字段.

Some details: 1) I'm using a single index, and I'm over allocating shards (100 shards), as suggested when one uses the "users" data flow pattern. 2) I'm specifying routing parameters explicitly in the URL (i.e. &routing=site-ID),not via any siteId field in the documents that are indexed.

7小时后更新:

1)所有查询应按站点ID(即租户ID)进行过滤.如果确实将站点ID与页面ID结合在一起,则我假设/希望可以使用前缀过滤器来对站点ID进行过滤.我想知道这是否像在单个专用 siteId 字段上进行过滤一样快(例如,是否可以缓存结果).

1) All queries should be filtered by site id (that is, tenant id). If I do combine the site ID with the page ID, I suppose/hope that I can use a prefix filter, to filter on site ID. I wonder if this will be as fast as filtering on a single dedicated siteId field (e.g. can the results be cached).

2)查询示例:全文搜索.列出所有用户.列出所有页面.列出特定页面的所有子页面/后继页面.加载单个页面(通过 _source ).

2) Example queries: Full text search. List all users. List all pages. List all child/successor pages of a certain page. Load a single page (via _source).

22小时后更新:

3)我能够按页面ID进行搜索,因为作为ElasticSearch的 _id ,我存储:(site-ID):( page-ID).因此,将页面ID作为" pageIdPath "的最后一个元素隐藏"在其他地方并不是一个大问题.

3) I am able to search by page ID, because as ElasticSearch's _id, I store: (site-ID):(page-ID). So it's not a probolem that the page ID is otherwise "hidden" as the last element of pageIdPath.

4)对于这些ID字段,我使用 index:not_analyzed .

4) I use index: not_analyzed for these ID fields.

推荐答案

在索引和搜索中是否使用1字段时存在性能问题.我认为您误以为1提起诉讼会加快事情的进展.

There are performance issues when indexing and searching if you use 1 field. I think you're mistaken in thinking 1 filed would speed things up.

如果使用1个字段,则基本上有2个映射选择:

If using 1 field you have basically 2 mapping choices:

  1. 如果使用默认映射,字符串(siteID):( grandparent-page-ID).(parent-page-ID).(page-ID)将被破坏分析器将其定位到令牌(siteID) (祖父母页面ID) (父页面ID) (页面-ID).现在,您的ID就像一堆单词,当您希望它与siteID匹配时,术语过滤器或前缀过滤器可能会从pageID中找到匹配项.

  1. If you use the default mappings, the string (siteID):(grandparent-page-ID).(parent-page-ID).(page-ID) will get broken up by the analyzer to the tokens (siteID) (grandparent-page-ID) (parent-page-ID) (page-ID). Now your ids are like a bag of words and either a term or prefix filter might find a match from the pageID when you meant for it to match the siteID.

如果您设置自己的分析器(并且我想知道您是否可以想到一种执行此方法的好方法),那么想到的第一个就是关键字(或not_analyzed)分析器.这样会将字符串保留为一个令牌,因此您不会丢失上下文.但是现在,当您使用前缀过滤器时,您会获得很大的性能提升.想象一下,我将字符串"123.456.789" 索引为一个令牌(siteID,parentpageID.pageID).我想通过sideID = 123进行筛选,因此我使用前缀过滤器.您可以在此处阅读,此前缀过滤器实际上已扩展为包含数百个术语的 bool 查询(或[ 123 1231 1232 1233 等...),当您只能更好地构建数据结构时,这将浪费大量的计算能力.

If you set your own analyzer (and I would like to know if you can think of a good way of doing this) the first one that comes to mind is the keyword (or not_analyzed) analyzer. This will keep the string as one token so you don't lose the context. However now you have a big performance hit when using a prefix filter. Imagine I index the string "123.456.789" as one token (siteID,parentpageID.pageID). I want to fileter by sideID = 123 and so I use a prefix filter. As you can read here this prefix filter is actually expaned into a bool query of hundreds of terms all ORed together (123 or 1231 or 1232 or 1233 etc...), which is massive waste of computing power when you could just structure your data better.

我敦促您阅读有关Lucene的PrefixQuery及其工作方式的更多信息.

I urge you to read more about lucene's PrefixQuery and how it works.

如果我是你,我会这样做.

If I were you I would do this.

"properties": {
  "site_id": {
    "type": "string",
    "index": "not_analyzed" //keyword would also work here, they are basically the same
  },
  "parent_page_id": {
    "type": "string",
    "index": "not_analyzed"
  },
  "page_id": {
    "type": "string",
    "index": "not_analyzed"
  }<
  "page_content": {
    "type": "string",
    "index": "standard" //you may want to use snowball to enable stemming
  }
}

查询

在siteID为"123"下对"elasticsearch教程"进行文本搜索

Queries

Text search for "elasticsearch tutorial" under siteID "123"

"filtered": {
  "query": {
    "match": {
      "page_content": "elasticsearch tutorial"
    }
  },
  "filter": {
    "term": {
      "site_id": "123"
    }
  }
}

网站"123"下页面456"的所有子页面

All child pages of page "456" under site "123"

"filtered": {
  "query": {
    "match_all": {}
  },
  "filter": {
    "and": [
      {
        "term": {
          "site_id": "123"
        }
      },
      {
        "term": {
          "parent_page_id": "456"
        }
      }
  }
}

这篇关于ElasticSearch:使用复合的Tenant-ID + page-ID字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 21:58