本文介绍了针对探查器结果的Marklogic查询优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

MarkLoggers在那里,

Hi MarkLoggers out there,

我再有一个问题要问您!我有一个包含邮政编码信息的文档集合. 400.000个文档.这些文档按顺序排列,每个文档一个邮政编码,每个文档包含400个功能,按类别和可变符排序,如下所示:

I have again a question for you! I have a collection of documents containing postalcode information. 400.000 docs. The docs are ordered one zip code per doc, each doc contains 400 features , ordered in categories and variabeles like so:

<postcode id="9728" xmlns="http://www.nvsp.nl/p4">
<meta-data>
<!--
Generated by DIKW for NetwerkVSP ST!P
-->
<version>0.3</version>
<dateCreated>2014-06-28+02:00</dateCreated>
</meta-data>
<category name="Oplages">
<variable name="Oplage" updated="2014-08-12+02:00">
  <segment name="Bruto">1234</segment>
  <segment name="Stickers">234</segment>
  <segment name="Netto">1000</segment>
  <segment name="Aktief">J</segment>
</variable>
</category>
<category name="Automotive">
<variable name="Leaseauto">
<segment name="Leaseauto">2.68822210725987</segment>
</variable>
<variable name="Autotype">
<segment name="De Oudere Stadsrijder">4.61734781858941</segment>
<segment name="De Dure Tweedehandsrijder">6.02534919813761</segment>
<segment name="De Autoloze">41.187790998448</segment>
<segment name="De Leasende Veelrijder">0.608035868253147</segment>
<segment name="De Modale Middenklasser">13.1996896016555</segment>
<segment name="De Vermogende Autoliefhebber">4.45283669598206</segment>
<segment name="De Vermogende Kilometervreter">2.07690981203656</segment>
<segment name="De Doelmatige Budgetrijder">17.2048629073978</segment>
<segment name="De Doorsnee Nieuw Kopende Automob">10.1595102603897</segment>
</variable>
...
400 more cat/var/segment element
...
</postcode>

我需要根据邮政编码元素中的id属性找到文档的子集,并仅返回特定元素.

I need to find a subset of docs based on the id attribute in postcode element and return only specific elements.

要返回的元素在cat Oplages var Oplage中,我需要细分Bruto和Netto

Elements to return are in cat Oplages var Oplage and I need segments Bruto and Netto

现在我们有一个rest api扩展,可以做到这一点,但还不够快.

Now we have a rest api extension that does that but not fast enough.

查询示例:

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
declare namespace p4ns       = "http://www.nvsp.nl/p4";
declare namespace wijkns     = "http://www.nvsp.nl/wijk";

let $segment := "Bruto"

let $zoeker0 := cts:search(fn:doc(), cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313")))
let $zoeker1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313")))
let $zoeker2 := cts:search(/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), xs:QName("id"), ("2311","2312","2313")))

let $inhoud1 := $zoeker0//p4ns:segment[@name=$segment]
let $inhoud2 := $zoeker1//p4ns:segment[@name=$segment]/text()

let $r1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), xs:QName("name"), "=", $segment))

return $inhoud2

现在,如果我配置此测试查询,则最慢的部分是在cts:search返回的de docs中查找"Bruto"段.我知道我应该避免通过xpath查找文档中的元素,但是我不知道如何将仅命中索引的这两位组合在一起...

Now if I profile this test query the slow part is looking up the "Bruto" segment in de docs returned by the cts:search. I know I should avoid looking up elements in docs via xpath but I do not know how to combine the two bits hitting only indexes...

分析器结果:

.main:13:44 1446    27  7127    30  7938    @name = "Bruto"
.main:12:44 1446    27  6956    30  7793    @name = "Bruto"
.main:17:11 1   9.3     2431    9.4     2458    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), fn:QName("", "name"), "=", $segment))
.main:10:16 1   7.2     1874    7.2     1885    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), fn:QName("", "id"), ("2311", "2312", "2313")))

查询结果:

1234
4567
3456

现在我的问题:

1)"@name =" Bruto"是什么意思,为什么它变慢?

1) What does "@name = "Bruto"" mean and why is it slow?

2)理想情况下,我会将文档搜索与通过xpath查找segment元素组合为一个组合,但是如果我将$ zoeker放入cts:search中,则它是不可搜索的……什么是获得结果的最佳方法?一口气回来吗?

2) Ideally I would combine the search of docs with looking up the segment element via xpath into one combination but if I put $zoeker into a cts:search it is unsearchable... What is the best approach to get my result back in one go?

提前谢谢!

雨果

推荐答案

我看到了两个基本问题:到数据库的行程过多,而这些行程又带回了您真正不想要的数据.目的是最大程度地减少数据库查找的次数,并使每次查找尽可能精确.

I see two basic problems: too many trips to the database, and those trips bring back too much data that you don't really want. The goal is to minimize the number of database lookups, and make each lookup as precise as possible.

在这种情况下,执行数据库查找的主要方法是cts:search.其中有几种:可能太多,有时甚至从不使用结果.我认为其中一些是剩余的实验.配置文件时,配置干净的代码很重要.

In this case the main way you are performing database lookups is cts:search. There are several of those: probably too many, and sometimes the results are never used. I think some of those are leftover experiments. When you profile it's important to profile clean code.

接下来,大多数事件探查器时间都在该@name=$segment XPath谓词中.这也是重复的,并且没有充分的理由.摆脱重复,它会更快.

Next, most of the profiler time is in that @name=$segment XPath predicate. That's repeated too, and for no good reason. Get rid of the repetition and it will go faster.

但是出现@name=$segment的另一个原因是因为MarkLogic索引了文档,而不是节点.它索引节点的名称和值,但是每个索引条目都指向一个文档-或更具体地说是一个片段,但是我们不要去那里.因此,当您有一个包含segment/@name值的数十或数百个索引条目的文档时,所有这些索引条目都指向文档根.当您仅要求与特定名称匹配的段时,索引查找将与整个文档匹配.因此,评估必须遍历每个文档树.这在CPU周期中可能会非常昂贵,这就是探查器向您显示的内容.

However the other reason @name=$segment shows up is because MarkLogic indexes documents, not nodes. It indexes the names and values of nodes, but each index entry points to a document - or more specifically a fragment, but let's not go there. So when you have one document with tens or hundreds of index entries for segment/@name values, all those index entries point to the document root. When you ask for only the segments that match a particular name, the index lookup matches the entire document. So the evaluation has to walk each document tree. That can be expensive in CPU cycles, and that's what the profiler is showing you.

如果不重组文档,或者对共现做一些聪明的事情,这是无法治愈的.但是,我们可以清理您的查询,并使用完整路径将其转换为单个XPath表达式.让我们看看这对于您的用例而言是否足够快.

There's no cure for that without restructuring the document, or perhaps doing something clever with co-occurrences. However we can clean up your query, and convert it to a single XPath expression using full paths. Let's see if this is fast enough for your use-case.

declare namespace p4ns="http://www.nvsp.nl/p4" ;

(: These might be external parameters. :)
let $segment := "Bruto"
let $ids := ("2311","2312","2313")
return collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment]/string()

如果我插入示例XML并将其ID更改为2313,则返回单个值1234.分析它可以在不到1毫秒的时间内显示33个表达式,其中66%的时间是通过XPath在数据库中查找的.但是,仍然必须查看所有segment/@name值:在这种情况下,其中的14个占了10%的时间.

If I insert your sample XML and change its id to 2313, that returns the single value 1234. Profiling it shows 33 expressions in less than 1-ms, with 66% of the time in the database lookup via XPath. However it still has to look at all the segment/@name values: in this case 14 of them, taking 10% of the time.

请注意,我没有使用cts:search,也没有使用任何范围索引. MarkLogic会自动为XPath值相等查找索引节点值.您仅需要范围索引即可执行特殊操作:例如构面,排序和不等式查找.

Note that I didn't use cts:search nor any of your range indexes. MarkLogic automatically indexes node value for XPath value-equality lookups. You only need range indexes for special operations: for example facets, sorting, and inequality lookups.

您可以在此方面做得更好:

You could do a little better with this:

(collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment])[1]/string()

现在,我们要告诉评估者,预计只有一场比赛.因此,它会在找到Bruto之后停止,而那是在文档的早期.在这种情况下,它是第一个,但是平均而言,(...)[1]应该将表达式数量减少一半.其他修剪树技术也应该有所帮助:例如,您可以将categoryvariable名称添加到输入中,并将其表示为XPath谓词.

Now we're telling the evaluator that there's only one match expected. So it'll stop after it finds Bruto, and that's early in the document. In this case it is the first one, but on average (...)[1] should cut the number of expressions in half. Other tree-pruning techniques should also help: for example maybe you can add the category and variable names to your inputs, and express them as XPath predicates.

这可能是您备份并查看全局的好时机.您要通过此查询完成什么操作?可能有一种更有效的方法来实现您的目标.

This might be a good time for you to back up and look at the big picture. What is it you're trying to accomplish with this query? There may be a much more efficient way to reach your goal.

如果这是您最常用的用例,那么理想情况下,您将重组文档,以使每个id段查找都成为可计算的doc($uri)调用.在这种情况下,我不确定这是个好主意,但我对您的应用程序并不了解.

If this is your most common use case, then ideally you would restructure your documents so that every id-segment lookup becomes a computable doc($uri) call. I'm not sure that's a good idea in this particular case, but I don't have complete knowledge of your application.

另一种方法是使用内存中的值索引和 https://docs.marklogic .com/cts:value-co-occurrences 完全避免查看XML.但这是一种复杂的方法,在这里我将不进行探讨.

Another approach is to use in-memory value indexes and https://docs.marklogic.com/cts:value-co-occurrences to avoid looking at the XML at all. However that's a complicated approach and I'm not going to explore it here.

这篇关于针对探查器结果的Marklogic查询优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:15