hetnetpy icon indicating copy to clipboard operation
hetnetpy copied to clipboard

Speeding up the Neo4j import

Open veleritas opened this issue 9 years ago • 11 comments

Hi Daniel,

What was the reasoning behind importing the nodes and edges of the hetnet using the py2neo interface? I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.

From my experiments it seems like importing 20000 nodes and 22000 edges into neo4j with the current code takes roughly 45 minutes on an AWS instance with 8 cores and 32 GB RAM. At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.

Best, Toby

veleritas avatar Feb 01 '17 22:02 veleritas

At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.

Yeah it took ~10 hours.

I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.

I personally didn't spend too much time optimizing because I didn't plan on running this Neo4j import step too often. If you're running it a lot, you may want to look into solutions.

The problem with the batch TSV import is that TSV's are bad at representing properties that only exist for a single node or relationship type. However, if you don't care about properties (besides name which every node has), you could use the TSV import. Or perhaps you can make a TSV where missing values don't get written as properties. Or make several TSVs (one for each node and relationship type). Looking at the import tool doc, this could be the way to go.

Ah now I remember another reason I didn't use the import tool. I don't think I was able to fully automate the import... i.e. there was network specific commands that had to be written... therefore it would decrease the versatility of the code. I wanted hetio to work for any hetnet, not just Hetionet. Not sure if it's now possible to use the import tool for any hetio network.

dhimmel avatar Feb 02 '17 16:02 dhimmel

If you want to use the import tool, the Hetionet TSVs could be a good place to start and get benchmarks.

dhimmel avatar Feb 02 '17 16:02 dhimmel

After some testing it turns out that it is actually much faster to import edges individually into neo4j when using py2neo version 3. As described in this link, it seems that py2neo version 3 uses subgraphs in order to make updates to neo4j.

Effect of batch size on import speed:

Imported object Batch size Import speed
Node 1 ~100/s
Node 10 ~400/s
Node 100 ~500/s
Node 200 ~550/s
Node 500 ~530/s
Node 1000 ~500/s
Edge 1 ~120/s
Edge 5 ~90/s
Edge 10 ~100/s
Edge 100 ~7/s

Based on these results, it seems that updating neo4j with a subgraph containing multiple edges is actually slower than updating the graph one edge at a time. I suspect that this is because the underlying py2neo code converts the subgraph back into individual edges anyways, and therefore spends time making redundant calculations. All speed estimates are approximate, and testing was done on an AWS m4.2xlarge instance with EBS.

veleritas avatar Feb 15 '17 19:02 veleritas

@veleritas your benchmarks are awesome. Let's address this after #6 is merged. The easy fix would be changing the default value for edge_queue to 10 or another small value.

But happy to consider a more dramatic code refactoring if you think it's worth it.

dhimmel avatar Feb 15 '17 19:02 dhimmel

@veleritas I changed the defaults in https://github.com/dhimmel/hetio/commit/d026d13685b819c415cbdf4b785b1dbdaca9cff1. I made you the commit author, since you did all of the hard work!

dhimmel avatar Mar 02 '17 21:03 dhimmel

I've been experimenting with the batch CSV import provided by Neo4j (version 3), and it seems so far that the batch import can be made to work with Rephetio v2.0. Current initial testing shows that a half scale Rephetio with 1.2 million edges and ~20k nodes can be imported into Neo4j in roughly 10 seconds.

veleritas avatar Mar 07 '17 21:03 veleritas

batch import can be made to work with Rephetio v2.0

@veleritas awesome, I'd be really interested in getting this feature implemented in hetio. Happy to review a pull request or help out in any way you see fit.

dhimmel avatar Mar 09 '17 20:03 dhimmel

At the moment the batch CSV import is implemented as a tack-on script to the integrate repository. It basically sidesteps the hetio export_neo4j() function completely.

Process:

  1. This script creates the CSV files needed after the integrate.ipynb script finishes.
  2. This script then downloads neo4j and makes the necessary configuration modifications to allow access from python with py2neo.
  3. Finally, a bash script loads everything into neo4j.

At the moment things seem to work just fine, and neo4j has had no complaints so far, but I haven't tested full compatibility with the entire network yet. I'm going to need some more time to figure out if the pipeline will work with the entire network before I'm ready to push anything back into hetio. This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.

veleritas avatar Mar 09 '17 20:03 veleritas

This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.

The main reason I avoided the CSV import is that I didn't see a way to losslessly export a graph (in its entirety) like hetio.export_neo4j(). Code in hetio should be general (work for more than just Hetionet v1.0). However, it's fine to have a lossy export that documents it's limitations.

Let's revisit this at a later time when we know more. I would say, if you find yourself constantly copying and pasting the CSV import code, then it would make sense to move it upstream.

dhimmel avatar Mar 10 '17 18:03 dhimmel

Hi Daniel,

Just wanted to ask if we should be revisiting the Neo4j integration code issue. I've since switched over to using the built-in Neo4j CSV loader since it is so much faster, and haven't had any issues with the loss of license metadata so far. It's been working without any issues with the full network so far. Latest code is here. The CSV import method has also been easy to adapt to the matrix DWPC calculation method by @mmayers12 .

Again we've been discussing on our end that any changes we make to the project should be integrated back upstream if it makes sense, so let us know if you're interested in these changes, or if we need to tweak it slightly further before you're willing to pull upstream.

Toby

veleritas avatar Aug 02 '17 17:08 veleritas

Hey @veleritas, I see two options for incorporating CSV Neo4j import functionality into hetio.

  1. Create a function like export_neo4j_via_csv, presumably in hetio.neo4j. This would not replace export_neo4j but instead add another route to Neo4j import for users who don't require edge properties to be maintained and would like the speed increase.

  2. Add a note in the export_neo4j or in a README that references your code for CSV import. The note would direct users looking for a speedup to check out CSV import.

If you're willing to do the work to submit a PR for option 1, then this is preferable. However, we need to make sure implementations in hetio are modular... so there may be some additional work needed to convert the code from your notebooks. Anyways, I'm happy to help with review and some implementation if needed. This would be a valuable feature, and it would be nice for you to not have to maintain an independent CSV import patch.

dhimmel avatar Aug 03 '17 16:08 dhimmel