为什么neo4j 会发出警告:“此查询在断开连接的模式之间构建笛卡尔积"?

本文介绍了为什么neo4j 会发出警告:“此查询在断开连接的模式之间构建笛卡尔积"?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在从 CSV 导入数据后，我正在定义两个实体之间的关系，Gene 和 Chromosome，我认为是简单而正常的方式:

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:

MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);

然而，当我这样做时，neo4j(浏览器 UI)会抱怨:

Yet, when I do so, neo4j (browser UI) complains:

此查询在断开连接的模式之间构建笛卡尔积.如果查询的一部分包含多个断开连接的模式，这将在所有这些部分之间构建笛卡尔积.这可能会产生大量数据并减慢查询处理速度.虽然偶尔有意，但通常可以重新制定查询以避免使用此交叉产品，可能是通过添加不同部分之间的关系或使用 OPTIONAL MATCH(标识符是:(c)).

我不明白这是什么问题.染色体ID是一个非常简单的外键.

I don't see what the issue is. chromosomeID is a very straightforward foreign key.

推荐答案

浏览器告诉你:

它通过在每个 Gene 实例和每个 Chromosome 实例之间进行比较来处理您的查询.如果你的数据库有G 基因和C 染色体，那么查询的复杂度是O(GC).例如，如果我们处理人类基因组，则有 46 条染色体，可能有 25000 个基因，因此 DB 必须进行 1150000 次比较.
您或许可以通过更改查询来提高复杂性(和性能).例如，如果我们创建了一个:Gene(chromosomeID) 上的索引，并改变查询，以便我们最初只匹配具有最小基数的节点(46 条染色体)，我们只会做 O(G)(或 25000)比较"——这些比较实际上是快速索引查找！这种方法应该快得多.

It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.

一旦我们创建了索引，我们就可以使用这个查询:

Once we have created the index, we can use this query:

MATCH (c:Chromosome)
WITH c
MATCH (g:Gene) 
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);

它使用 WITH 子句强制第一个 MATCH 子句先执行，避免笛卡尔积.第二个MATCH(和WHERE)子句使用第一个MATCH子句的结果和索引来快速获取属于每个子句的确切基因染色体.

It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.

[更新]

在最初编写此答案时，WITH 子句很有帮助.即使 WITH 被省略，新版本的 Neo4j(如 4.0.3)中的 Cypher planner 现在生成相同的计划，并且不创建笛卡尔积.您可以随时PROFILE 查询的两个版本，以查看使用/不使用 WITH 的效果.

The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.

这篇关于为什么neo4j 会发出警告:“此查询在断开连接的模式之间构建笛卡尔积"?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！