问题描述
我的问题类似于此处指出的问题:
My question is similar to the one pointed here :
我有2个表实体(Entities.txt)& 关系(EntitiesRelationships_Updated.txt)如下所示:两个表都位于Neo4j数据库中的导入文件夹中.我想做的是使用load csv命令加载表,然后创建关系.
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
如下表所示:如果ParentID为0,则表示ENT_ID没有父级.如果已填充,则它具有父级.例如,在下表中,ENT_ID = 3是ENT_ID = 4的父级,而ENT_ID = 1是ENT_ID = 2的父级
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
实体表有200万条记录,关系表有40万行
The Entity table has 2 million records and the relationship tables has about 400K lines
每个RID都有一个与之关联的特定标签.例如,RID = 1的关系是 A FATHER_OF B ; RID = 2,关系是 A MOTHER_OF B .同样,有20个这样的RID关联.
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
这两个都是txt格式.
Both of these are in txt format.
我的第一步是加载实体表.我使用了以下脚本:
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
此查询工作正常.加载280万条记录大约需要10分钟.我要做的下一步是对记录建立索引:
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
此查询也可以正常运行.之后,我尝试使用与上面的链接类似的查询从关系表中创建关系:
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
当我这样做时,我的查询保持运行约一个小时,并且停止在特定大小(以我的情况为2.2gb),我根据上面的链接关注了此查询.这包括来自以下解决方案的修改,但仍然无效
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
我还有一个查询,如下(基于以上链接).我想创建基于实体表的关系时运行此查询
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
当我尝试运行此查询时,出现Java堆空间错误.不幸的是,我无法获得这些解决方案.
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
如果我做错了事,请您指教吗?
Could you please advice if I am doing something wrong?
推荐答案
此查询使您可以利用:Entity(ENT_ID)
索引:
This query allows you to take advantage of your :Entity(ENT_ID)
index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
当属性值来自另一个节点时,Cypher不使用索引.为了解决这个问题,以上查询使用子句将child.PARENTID
表示为变量(pid
).此查询的时间复杂度应为O(N).您原始查询的复杂度为O(N * N).
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH
clause to represent child.PARENTID
as a variable (pid
). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
如果以上查询花费的时间太长或遇到可能与内存不足有关的错误,请尝试使用此变体,该变体一次创建1000个新关系.您可以将1000
更改为适合您的任何数字.
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000
to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
WHERE
子句过滤出已经具有父级关系的child
节点.并且MERGE
操作已更改为更简单的CREATE
操作,因为我们已经确定该关系尚不存在.该查询返回创建的关系数量的计数.如果结果小于1000
,则已创建所有父关系.
The WHERE
clause filters out child
nodes that already have a parent relationship. And the MERGE
operation has been changed to a simpler CREATE
operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000
, then all parent relationships have been created.
最后,要使重复查询自动化,您可以安装 APOC neo4j服务器上的插件,并使用apoc.periodic.commit
过程,该过程将重复调用查询,直到返回0.在此示例中,我使用limit
参数10000:
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit
procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit
parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
这篇关于在庞大的数据集上创建关系时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!