问题描述
在从 CSV 导入数据后,我正在定义两个实体之间的关系,Gene 和 Chromosome,我认为是简单而正常的方式:
I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
然而,当我这样做时,neo4j(浏览器 UI)会抱怨:
Yet, when I do so, neo4j (browser UI) complains:
此查询在断开连接的模式之间构建笛卡尔积.如果查询的一部分包含多个断开连接的模式,这将在所有这些部分之间构建笛卡尔积.这可能会产生大量数据并减慢查询处理速度.虽然偶尔有意,但通常可以重新制定查询以避免使用此交叉产品,可能是通过添加不同部分之间的关系或使用 OPTIONAL MATCH(标识符是:(c)).
我不明白这是什么问题.染色体ID是一个非常简单的外键.
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
推荐答案
浏览器告诉你:
- 它通过在每个
Gene
实例和每个Chromosome
实例之间进行比较来处理您的查询.如果你的数据库有G
基因和C
染色体,那么查询的复杂度是O(GC)
.例如,如果我们处理人类基因组,则有 46 条染色体,可能有 25000 个基因,因此 DB 必须进行1150000
次比较. 您或许可以通过更改查询来提高复杂性(和性能).例如,如果我们创建了一个
:Gene(chromosomeID)
上的索引,并改变查询,以便我们最初只匹配具有最小基数的节点(46 条染色体),我们只会做O(G)
(或25000
)比较"——这些比较实际上是快速索引查找!这种方法应该快得多.
- It is handling your query by doing a comparison between every
Gene
instance and everyChromosome
instance. If your DB hasG
genes andC
chromosomes, then the complexity of the query isO(GC)
. For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do1150000
comparisons. You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on
:Gene(chromosomeID)
, and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only doO(G)
(or25000
) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
一旦我们创建了索引,我们就可以使用这个查询:
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
它使用 WITH
子句强制第一个 MATCH
子句先执行,避免笛卡尔积.第二个MATCH
(和WHERE
)子句使用第一个MATCH
子句的结果和索引来快速获取属于每个子句的确切基因染色体.
It uses a WITH
clause to force the first MATCH
clause to execute first, avoiding the cartesian product. The second MATCH
(and WHERE
) clause uses the results of the first MATCH
clause and the index to quickly get the exact genes that belong to each chromosome.
[更新]
在最初编写此答案时,WITH
子句很有帮助.即使 WITH
被省略,新版本的 Neo4j(如 4.0.3)中的 Cypher planner 现在生成相同的计划,并且不创建笛卡尔积.您可以随时PROFILE 查询的两个版本,以查看使用/不使用 WITH
的效果.
The WITH
clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH
is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH
.
这篇关于为什么neo4j 会发出警告:“此查询在断开连接的模式之间构建笛卡尔积"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!