本文介绍了与OrientDB相比,Neo4j中的Lucene在可靠的搜索查询方面存在一些不当行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍在评估 Neo4j OrientDB 。最重要的是,我需要Lucene作为全文索引引擎。所以我在两个数据库上创建了具有相同数据的相同模式(300Mio行)。我在查询两个系统中的不同内容时也很有经验。我在两边都使用了标准分析仪。 OrientDB测试查询结果非常好,在可靠性和速度方面非常好。 Neo4j的速度也不错,但在大多数情况下结果都很糟糕。让我们来看看Neo4j Lucene索引的不同问题。我总是举例说明它在OrientDB中的外观以及你应该从查询中得到的结果集。

I'm still in the evaluation of Neo4j vs. OrientDB. Most importantly I need Lucene as full-text index engine. So I created on both databases the same schema with the same data (300Mio lines). I'm also experienced with querying different things in both systems. I used the Standard Analyzer on both sides. The OrientDB test query results are all fine and really good in terms of reliability and speed. The speed of Neo4j is also ok but the results are kind of bad in most of the cases. So let's come to the different issues I have with Neo4j Lucene indexing. I always give you an example of how it would look in OrientDB and which result set you should be getting out of the query.

所以在这些例子中,有Applns有头衔。标题在两个数据库中都使用Lucene编制索引。 Applns也有一个ID来证明订购。在每个查询结束时,我都有一些问题。获得关于它们的反馈甚至答案会很棒。

So in these examples, there are Applns that have title(s). Titles are indexed with Lucene in both databases. Applns also have an ID just to demonstrate the ordering. At the end of each query I have some questions about them. It would be great to get some feedback or even answers about them.

这个查询非常简单。如果只有一个简单的单词而没有其他内容,则应测试数据库的行为方式。正如您所看到的,Neo4j结果比OrientDB更长。 OrientDB使用TFIDF来保持结果的简短性和实际搜索的可靠性。正如您在OrientDB中看到的第一个结果,有SOLAR标题。在Neo4j中也完全没有。

Well this query is very simple. It shall be tested how the database behave if there is just a simple word and nothing else. As you can see the Neo4j result is way longer then the one from OrientDB. OrientDB is using TFIDF to keep the results short and more reliable to the actual search. As you can see as first result in OrientDB, there is title with SOLAR. That is totally missing in Neo4j, too.

在Neo4j中: START n = node:titles('title:solar')RETURN n.title,n。 ID LIMIT 10


  1. 太阳辐射屏蔽颗粒和太阳辐射屏蔽树脂材料分散... 38321319

  1. SOLAR RADIATION SHIELDING PARTICULATE AND SOLAR RADIATION SHIELDING RESIN MATERIAL DISPERSED WITH ... 38321319

用于冷却太阳能电池板底部太阳能电池的太阳能电池组件有进气口和出气口...... 12944121

Solar module for cooling solar cells on the underside of a solar panel has air inlet and outlet openings ... 12944121

太阳能热组件的太阳能建筑组件,太阳能组件,太阳能组件的操作方法...... 324146113

Solar construction component for solar thermal assemblies, solar thermal assembly, method for operating a solar... 324146113

...

在OrientDB中: SELECT title,ID FROM Appln WHERE title LUCENEsolarLIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" LIMIT 10


  1. SOLAR 24900187

  1. SOLAR 24900187

太阳能装置及太阳能装置1876343

Solar unit and solar apparatus 1876343

带太阳能聚光器的太阳能组件13496706

Solar module with solar concentrator 13496706

...

问题:


  1. 为什么Neo4j没有使用TFIDF或者他们使用什么?

  2. Neo4j是否能够使用关键字匹配的某些顺序?

  3. 是否可以将TFIDF更改为OrientDB中的其他东西?



查询#1:一个单词查询按ID排序



Neo4j在使用TFIDF之前订购ID。从查询#0可知,Neo4j没有使用TFIDF,所以它基本上只是通过Lucene查询的第一个结果进行搜索。在OrientDB中,除了它仍然通过良好的TFIDF搜索然后订购。

Query #1: One word query with order by ID

Neo4j is ordering the ID's before using TFIDF. As known from Query#0 Neo4j is not using TFIDF so it's basically just searching via first results of the Lucene query. In OrientDB besides it's still searching by good TFIDF's and then ordering.

在Neo4j中: START n = node:titles('title:solar')返回n。标题,n.ID订购n.ID ASC限制10


  1. 可堆叠的平屋顶/地板框架太阳能电池板318

  1. Stackable flat-roof/floor frame for solar panels 318

生产太阳能电池接触的方法636

Method for producing contact for solar cells 636

太阳能电池和其制造方法1217

Solar cell and fabrication method thereof 1217

...

在OrientDB中: SELECT标题,ID FROM Appln WHERE标题LUCENEsolarORDER BY ID ASC LIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" ORDER BY ID ASC LIMIT 10


  1. 太阳能装置和太阳能装置1876343

  1. Solar unit and solar apparatus 1876343

带太阳能聚光器的太阳能组件13496706

Solar module with solar concentrator 13496706

太阳能收集器的太阳能跟踪器16543688

SOLAR TRACKER FOR SOLAR COLLECTOR 16543688

...

问题:


  1. 如何在OrientDB中进行搜索看起来应该按ID排序并且仍然是matc hing最好的TFIDF。

  2. Neo4j有没有办法在通过ID订购之前订购Lucene匹配?



查询#2:使用星号搜索的一个词



星搜索对Neo4j结果没有影响。 OrientDB结果有了很好的改变。

Query #2: One word with using a star search

Star search had no influence on the Neo4j results. OrientDB results changed in a good way.

在Neo4j中: START n = node:titles('title:solar *')RETURN n.title,n。 ID订购:n.ID ASC限制10


  1. 太阳能电池板的可堆叠平屋顶/地板框架318

  1. Stackable flat-roof/floor frame for solar panels 318

生产太阳能电池接触的方法636

Method for producing contact for solar cells 636

太阳能电池及其制造方法1217

Solar cell and fabrication method thereof 1217

...

在OrientDB : SELECT title,ID FROM Appln WHERE title LUCENEsolar *ORDER BY ID ASC LIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER BY ID ASC LIMIT 10


  1. 高性能太阳能甲烷发生器8354701

  1. High performance solar methane generator 8354701

全塑蜂窝太阳能热水器8355379

All-plastic honeycomb solar water-heater 8355379

板式太阳能集热板芯及其制造方法8356173

Plate type solar energy heat collector plate core and its manufacturing method 8356173

...

问题:


  1. Neo4j是否会忽略星标搜索?



查询#3:搜索由空格划分的2个单词



这里奇怪的是你需要在这里将'title:solar panel'更改为该查询。其他你只是得到错误。到目前为止,OrientDB似乎还不错。

Query #3: Searching for 2 words devided by a space

The strange here is that you need to change 'title:solar panel' to that query here. Otherwhise you just get errors. OrientDB seems good so far.

在Neo4j中: START n = node:titles(title =solar panel)RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10


  1. 817 ms内返回0行

在OrientDB中: SELECT title,ID FROM Appln WHERE title LUCENEsolar panelORDER BY ID ASC LIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar panel" ORDER BY ID ASC LIMIT 10


  1. SOLAR PANEL 1584567

  1. SOLAR PANEL 1584567

SOLAR PANEL 1616547

SOLAR PANEL 1616547

SOLAR PANEL 2078382

SOLAR PANEL 2078382

SOLAR PANEL 2078383

SOLAR PANEL 2078383

太阳能电池板2178466

Solar panel 2178466

...

问题:


  1. 为什么Neo4j在这里需要一个特殊的查询,至少不会抛出任何错误?

  2. 为什么查询失败并且没有回馈任何内容?我知道Neo4j在这里搜索较低的字母,所以它区分大小写。但为什么会这样呢?我的意思是我使用默认分析器和Neo4j Lucene的文档说它是真的,所以它意味着to_lower_letter。



查询#4:现在以大写字母搜索相同的查询



与#3相同的问题。在Neo4j中只是搜索返回大写字母的结果。 OrientDB结果再次看起来很好。

Query #4: Now searching for the same query in capital letters

The same issue like in #3. In Neo4j just searching returning the capital letters results of the words. OrientDB results looking fine again.

在Neo4j中: START n = node:titles(title =SOLAR PANEL)RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10


  1. SOLAR PANEL 348800

  1. SOLAR PANEL 348800

SOLAR PANEL 420683

SOLAR PANEL 420683

SOLAR PANEL 1393804

SOLAR PANEL 1393804

SOLAR PANEL 1584567

SOLAR PANEL 1584567

SOLAR PANEL 1616547

SOLAR PANEL 1616547

...

在OrientDB中: SELECT title,ID FROM Appln WHERE title LUCENESOLAR PANELORDER BY ID ASC LIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL" ORDER BY ID ASC LIMIT 10


  1. SOLAR PANEL 1584567

  1. SOLAR PANEL 1584567

SOLAR PANEL 1616547

SOLAR PANEL 1616547

SOLAR PANEL 2078382

SOLAR PANEL 2078382

SOLAR PANEL 2078383

SOLAR PANEL 2078383

太阳能电池板2178466

Solar panel 2178466

...

问题:


  1. 与#3相同的问题,如何使用to_lower_letter进行搜索?



查询#5:合并两个单词并使用星标搜索



这里我想将单词搜索与星搜索结合起来。但是在相同的搜索中,我无法找到匹配,因为他希望明星像往常一样在标题中签名。但我不能说'标题:SOLAR PANEL *'。这也是禁止的。在OrientDB,一切都很好。

Query #5: Combining two words and using the star search

Here I want to combine words search with star search. But with the equal search I'm not able to find matches because he expects the star as usual sign in the title. But I'm not able to say 'title:SOLAR PANEL*'. That's also forbidden. In OrientDB everything is fine.

在Neo4j中: START n = node:titles(title =SOLAR PANEL *)RETURN n.title,n.ID ORD BY N.ID ASC LIMIT 10


  1. 895 ms内返回0行

在OrientDB中: SELECT title,ID FROM Appln WHERE title LUCENESOLAR PANEL *ORDER BY ID ASC LIMIT 10

In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL*" ORDER BY ID ASC LIMIT 10


  1. SOLAR PANELS 1405717

  1. SOLAR PANELS 1405717

SOLAR PANEL 1584567

SOLAR PANEL 1584567

SOLAR PANEL 1616547

SOLAR PANEL 1616547

SOLAR PANEL 2705081

SOLAR PANEL 2705081

太阳能电池板2766555

Solar Panel 2766555

...

问题:


  1. 如何将一些单词与Neo4j中的星级搜索结合起来?



查询#6:计算查询结果



我真正需要的最后一件事是快速查找有多少结果总体。在这里,Neo4j发现结果更快,但总是找到比OrientDB更少的匹配。寻找太阳能有点接近彼此。但另一项测试并非如此接近。

Query #6: Counting query results

The last thing I really need is a fast lookup how many results are there overall. Here Neo4j is finding a result way faster but always finding less matches then OrientDB. Searching for Solar is kind of close to each other. But another test was not that close.

在Neo4j中: START n = node:titles(title:Solar)RETURN count(*)

143211 220秒

143211 in 220 sec

在OrientDB中: SELECT count(*)title FROM Appln WHERE标题LUCENESolarLIMIT -1

In OrientDB: SELECT count(*) title FROM Appln WHERE title LUCENE "Solar" LIMIT -1

148029 50秒

148029 in 50 sec

问题:


  1. 如何在两个系统上改善查找时间?

  2. 为什么两个系统都找到不同的数量火柴?其他关键字也会发生。也许其他索引使用?

这就是现在的一切。如果您需要任何其他查询,请告诉我并交付。
我认为比较Lucene的实现非常重要,因为Lucene拥有数百万个节点,具有很多优点。感谢任何小小的提示。

Well that is everything for now. If you need any other query just tell me and I deliver it.I think it's very important to compare the Lucene implementation because with Millions of nodes Lucene has to many advantages. Thanks for any small tip.

顺便说一句:请不要提供有关使用Java代码代替查询的提示。我想使用Cypher,因为请求应该在浏览器中完成,就像在OrientDB中一样。我知道这里的所有内容都可以通过Java代码轻松完成。谢谢。

Btw: please don't give tips about using Java code instead for the query. I want to use Cypher because the request shall be done in the browser, like in OrientDB. I know that everything here is easily be done with Java code. Thank you.

推荐答案

好吧,我想分享一下我发现的问题,直到现在:

Well, I want to share what I found out about my issues until now:


  1. 无法更改Neo4j的TFIDF。他们正在使用一个无法更改的自己的实现。

  2. 在OrientDB中,搜索之前的订购目前很慢。

  1. It is not possible to change the TFIDF of Neo4j. They are using an own implementation that cannot be changed.
  2. In OrientDB ordering before searching is currently slow.

SELECT FROM (
  SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER BY ID ASC
)  LIMIT 1

Query executed in 11.531 sec. Returned 1 record(s)


SELECT FROM (
  SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER BY ID ASC
)  LIMIT 10

Query executed in 225.176 sec. Returned 10 record(s)

它之所以那么慢是因为它与Lucene不对应。

The reason for it's being that slow is that is does not corresponds with Lucene.



修复查询#3,#4和#5:



查询不正确。等于是直接匹配而不是模糊匹配。所以

Fixing Query #3,#4 and #5:

the query is not correct. The equal is a direct match and not the fuzzy one. So

START n=node:titles(title="solar panel") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10

需要替换为

START n=node:titles('title:solar\\ panel') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10

你需要逃脱密码中的事情真的很糟糕。这里两个单词的顺序很重要。但还有另一种方式可以说它

Really bad way that you need to escape things in the cypher. Here the order of the two words are important. But there is another way to say it

START n=node:titles('title:SoLar AND title:Panel') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10

但如果你真的很糟糕你有一个字符串的图像,只要问Neo4j的结果,你需要一个解析器。但是这里的单词顺序并不重要。

but also really bad if you image you have a string and just ask Neo4j for results, you need a parser. But here the order of the words does not matter.

OrientDB目前正在工作使计数更快(毫秒)。有些日子计划在2.0版本中发布。

OrientDB is currently working on making the counting faster (milliseconds). Planned in the 2.0 Release in some days.

Neo4j对此没有任何计划。

Neo4j has no plans about this.

这篇关于与OrientDB相比,Neo4j中的Lucene在可靠的搜索查询方面存在一些不当行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-21 00:21