1

I am trying to use apoc.periodic.iterate to reduce heap usage when doing very large transactions in a Neo4j database. I've been following the advice given in this presentation. BUT, my results are differing from those observed in those slides.

First, some notes on my setup:

  • Using Neo4j Desktop, graph version 4.0.3 Enterprise, with APOC 4.0.0.10
  • I'm calling queries using the .NET Neo4j Driver, version 4.0.1.
  • neo4j.conf values:
    • dbms.memory.heap.initial_size=2g
    • dbms.memory.heap.max_size=4g
    • dbms.memory.pagecache.size=2g

Here is the cypher query I'm running:

CALL apoc.periodic.iterate(
"UNWIND $nodes AS newNodeObj RETURN newNodeObj",
"CREATE(n:MyNode)
SET n = newNodeObj",
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)

And the line of C#:

var createNodesResCursor = await session.RunAsync(createNodesQueryString, new { nodes_in = nodeData });

where createNodesQueryString is the query above, and nodeData is a List<Dictionary<string, object>> where each Dictionary has just three entries: 2 strings, 1 long.

When attempting to run this to create 1.3Million nodes I observe the heap usage (via JConsole) going all the way up to the 4GB available, and bouncing back and forth between ~2.5g - 4g. Reducing the batch size makes no discernible difference, and upping the heap.max_size causes the heap usage to shoot up to almost as much as that value. It's also really slow, taking 30+ mins to create those 1.3 million nodes.

Does anyone have any idea what I may be doing wrong/differently to the linked presentation? I understand my query is doing a CREATE whereas in the presentation they are only updating an already loaded dataset, but I can't imagine that's the reason my heap usage is so high.

Thanks

cybersam
  • 63,203
  • 6
  • 53
  • 76

1 Answers1

1

My issue was that although using apoc.periodic.iterate, I was still uploading that large 1.3million node data set to the database as a parameter for the query!

Modifying my code to do the batching myself as follows fixed my heap usage problem, and the slowness problem:

  const int batchSize = 2000;
  for (int count = 0; count < nodeData.Count; count += batchSize)
  {
    string createNodesQueryString = $@"
      UNWIND $nodes_in AS newNodeObj
      CREATE(n:MyNode)
      SET n = newNodeObj";

    int length = Math.Min(batchSize, nodeData.Count - count);

    var createNodesResCursor = await session.RunAsync(createNodesQueryString,
                                                new { nodes_in = nodeData.ToList().GetRange(count, length) });

    var createNodesResSummary = await createNodesResCursor.ConsumeAsync();
  }