本文介绍了负载csv的查询即使在12小时后仍未完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Neo4j已有一段时间了.我在7天前计算机崩溃之前就运行了此查询,现在又以某种方式无法运行它.我需要从银行交易的csv中创建一个图形数据库.原始数据集大约有500万行,大约60列.

I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.

这是我使用的查询,从从真实数据导出CSV"开始演示:

This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);

我尝试过:

  1. 根据Micheal Hunger在有关加载大型数据集"的帖子中的建议,在运行查询之前使用LIMIT 0.

  1. Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.

每个语句使用单个MERGE(这是第一次合并,还有4个其他合并要使用),正如Michael在另一个发布.

Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.

尝试了CALL apoc.periodic.iterate和apoc.cypher.parallel,但不适用于LOAD CSV(似乎仅适用于没有LOAD CSV的MERGE和CREATE查询).我收到CALL apoc.periodic.iterate(")的以下错误:Neo.ClientError.Statement.SyntaxError:无效的输入'f':预期的空格,'.',节点标签,'[',"=〜",IN,STARTS,ENDS,CONTAINS,IS,'^','*' ,"/",%","+",-","=",〜",<>",!=,<",>",< =" ,> =",AND,XOR,OR,','或')'(第2行,第29列(偏移量:57))

Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).I get following error with CALL apoc.periodic.iterate(""):Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))

由于我的笔记本电脑具有16GB RAM,因此最大堆大小增加到16G.顺便说一句,发现我写这篇文章很困难,因为我现在尝试使用'PROFILE'再次运行,并且它已经运行了一个小时.

Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.

需要帮助来加载此500万行数据集的查询.任何帮助将不胜感激.谢谢!我正在PC上使用Neo4j 3.5.1.

Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

推荐答案

  1. 最重要的:在键属性上创建索引/约束.
  1. MOST IMPORTANT: Create Index/Constraint on the key property.
  1. 请勿将最大堆大小设置为已满系统RAM.将其设置为 50%.

尝试创建设置,而不是 SET .

您也可以使用 apoc.periodic.iterate 加载数据,但是使用PERIODIC COMMIT 也可以.

You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.

注意:(如果您使用 apoc.periodic.iterate 合并参数为 parallel = true 的节点/关系,则失败并显示NULL POINTER EXCEPTION.请小心使用)

NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)

Questioner删除交易"节点第三行中的Distinct,然后重新运行查询即可!

Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!

这篇关于负载csv的查询即使在12小时后仍未完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 09:09