本文介绍了ArangoDB 分面搜索性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在评估 ArangoDB 在面计算空间中的性能.还有许多其他产品能够通过特殊的 API 或查询语言执行相同的操作:

  • MarkLogic 方面
  • ElasticSearch 聚合
  • Solr 分面等

我们知道,Arango 中没有特殊的 API 来显式计算事实.但实际上,它是不需要的,感谢全面的 AQL,它可以通过简单的查询轻松实现,例如:

 FOR a in AssetCOLLECT attr = a.attribute1 INTO g返回 { 值:属性,计数:长度(g)}

此查询计算属性 1 上的一个方面并以以下形式产生频率:

[{值":测试属性1-1",计数":2000000},{值":测试属性1-2",计数":2000000},{值":测试属性1-3",计数":3000000}]

据说,在我的整个集合中,attribute1 采用了三种形式(test-attr1-1、test-attr1-2 和 test-attr1-3),并提供了相关的计数.我们几乎运行 DISTINCT 查询和聚合计数.

看起来简单干净.只有一个但非常大的问题 - 性能.

上面提供的查询运行了 31 秒!在只有 8M 文档的测试集合之上.我们尝试了不同的索引类型,存储引擎(有和没有rocksdb),调查解释计划无济于事.我们在本次测试中使用的测试文档非常简洁,只有三个简短的属性.

我们将不胜感激此时的任何意见.要么我们做错了什么.或者 ArangoDB 根本就不是为了在这个特定领域执行而设计的.

顺便说一句,最终目标是在不到一秒的时间内运行如下内容:

LET docs = (对于 IN 资产过滤 a.name 像 'test-asset-%'排序名称返回 a)让属性 1 = (FOR a in docsCOLLECT attr = a.attribute1 INTO g返回 { 值:属性,计数:长度(g[*])})让属性 2 = (FOR a in docsCOLLECT attr = a.attribute2 INTO g返回 { 值:属性,计数:长度(g[*])})让属性 3 = (FOR a in docsCOLLECT attr = a.attribute3 INTO g返回 { 值:属性,计数:长度(g[*])})让属性 4 = (FOR a in docsCOLLECT attr = a.attribute4 INTO g返回 { 值:属性,计数:长度(g[*])})返回 {计数:(返回{总计:长度(文档),偏移量:2,到: 4,方面:{属性 1:{来自:0,到: 5,总计:长度(属性1)},属性2:{来自: 5,到: 10,总计:长度(属性2)},属性 3:{来自:0,到:1000,总计:长度(属性3)},属性 4:{来自:0,到:1000,总计:长度(属性4)}}}),项目:(对于 IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),方面:{属性 1:(对于属性 1 中的一个 SORT a.count LIMIT 0, 5 返回一个),属性 2:(对于属性 2 中的一个 SORT a.value LIMIT 5, 10 返回 a),属性 3:(对于属性 3 中的 a LIMIT 0, 1000 返回 a),属性 4:(对于属性 4 中的一个 SORT a.count, a.value LIMIT 0, 1000 return a)}}

谢谢!

解决方案

原来 ArangoDB Google Group 上发生了主线程.这是一个完整讨论的链接

以下是当前解决方案的摘要:

  • 从已完成多项性能改进的特定功能分支运行 Arango 的自定义构建(希望他们能尽快将其发布到主要版本中)
  • 分面计算不需要索引
  • MMFiles 是首选的存储引擎
  • 应编写 AQL 以使用COLLECT attr = a.attributeX WITH COUNT INTO length"而不是count:length(g)"
  • AQL 应该被拆分成更小的部分并并行运行(我们正在运行 Java8 的 Fork/Join 来扩展 facet AQL,然后将它们连接成最终结果)
  • 一个 AQL 来过滤/排序和检索主实体(如果需要.在排序/过滤时添加相应的跳过列表索引)
  • 其余的是每个方面值/频率对的小 AQL

最终,与上面提供的原始 AQL 相比,我们获得了 >10 倍 的性能提升.

We are evaluating ArangoDB performance in space of facets calculations.There are number of other products capable of doing the same, either via special API or query language:

  • MarkLogic Facets
  • ElasticSearch Aggregations
  • Solr Faceting etc

We understand, there is no special API in Arango to calculate factes explicitly.But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:

 FOR a in Asset
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

This query calculate a facet on attribute1 and yields frequency in the form of:

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.Pretty much we run a DISTINCT query and aggregated counts.

Looks simple and clean. With only one, but really big issue - performance.

Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.Test documents we use in this test are very concise with only three short attributes.

We would appreciate any input at this point.Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.

btw, ultimate goal would be to run something like the following in under-second time:

LET docs = (FOR a IN Asset

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs),

    offset: 2,

    to: 4,

    facets: {

      attribute1: {

        from: 0,

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5,

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0,

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0,

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)

   }

}

Thanks!

解决方案

Turns out main thread has happened on ArangoDB Google Group.Here is a link to a full discussion

Here is a summary of current solution:

  • Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
  • No indexes are required for a facets calculations
  • MMFiles is a preferred storage engine
  • AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
  • AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
  • One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
  • The rest are small AQLs for each facet value/frequency pairs

In the end we have gained >10x performance gain compare to an original AQL provided above.

这篇关于ArangoDB 分面搜索性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-26 05:53