Speeding up the Neo4j import
Hi Daniel,
What was the reasoning behind importing the nodes and edges of the hetnet using the py2neo interface? I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.
From my experiments it seems like importing 20000 nodes and 22000 edges into neo4j with the current code takes roughly 45 minutes on an AWS instance with 8 cores and 32 GB RAM. At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.
Best, Toby
At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.
Yeah it took ~10 hours.
I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.
I personally didn't spend too much time optimizing because I didn't plan on running this Neo4j import step too often. If you're running it a lot, you may want to look into solutions.
The problem with the batch TSV import is that TSV's are bad at representing properties that only exist for a single node or relationship type. However, if you don't care about properties (besides name which every node has), you could use the TSV import. Or perhaps you can make a TSV where missing values don't get written as properties. Or make several TSVs (one for each node and relationship type). Looking at the import tool doc, this could be the way to go.
Ah now I remember another reason I didn't use the import tool. I don't think I was able to fully automate the import... i.e. there was network specific commands that had to be written... therefore it would decrease the versatility of the code. I wanted hetio to work for any hetnet, not just Hetionet. Not sure if it's now possible to use the import tool for any hetio network.
If you want to use the import tool, the Hetionet TSVs could be a good place to start and get benchmarks.
After some testing it turns out that it is actually much faster to import edges individually into neo4j when using py2neo version 3. As described in this link, it seems that py2neo version 3 uses subgraphs in order to make updates to neo4j.
Effect of batch size on import speed:
| Imported object | Batch size | Import speed |
|---|---|---|
| Node | 1 | ~100/s |
| Node | 10 | ~400/s |
| Node | 100 | ~500/s |
| Node | 200 | ~550/s |
| Node | 500 | ~530/s |
| Node | 1000 | ~500/s |
| Edge | 1 | ~120/s |
| Edge | 5 | ~90/s |
| Edge | 10 | ~100/s |
| Edge | 100 | ~7/s |
Based on these results, it seems that updating neo4j with a subgraph containing multiple edges is actually slower than updating the graph one edge at a time. I suspect that this is because the underlying py2neo code converts the subgraph back into individual edges anyways, and therefore spends time making redundant calculations. All speed estimates are approximate, and testing was done on an AWS m4.2xlarge instance with EBS.
@veleritas your benchmarks are awesome. Let's address this after #6 is merged. The easy fix would be changing the default value for edge_queue to 10 or another small value.
But happy to consider a more dramatic code refactoring if you think it's worth it.
@veleritas I changed the defaults in https://github.com/dhimmel/hetio/commit/d026d13685b819c415cbdf4b785b1dbdaca9cff1. I made you the commit author, since you did all of the hard work!
I've been experimenting with the batch CSV import provided by Neo4j (version 3), and it seems so far that the batch import can be made to work with Rephetio v2.0. Current initial testing shows that a half scale Rephetio with 1.2 million edges and ~20k nodes can be imported into Neo4j in roughly 10 seconds.
batch import can be made to work with Rephetio v2.0
@veleritas awesome, I'd be really interested in getting this feature implemented in hetio. Happy to review a pull request or help out in any way you see fit.
At the moment the batch CSV import is implemented as a tack-on script to the integrate repository. It basically sidesteps the hetio export_neo4j() function completely.
Process:
-
This script creates the CSV files needed after the
integrate.ipynbscript finishes. -
This script then downloads neo4j and makes the necessary configuration modifications to allow access from python with
py2neo. - Finally, a bash script loads everything into neo4j.
At the moment things seem to work just fine, and neo4j has had no complaints so far, but I haven't tested full compatibility with the entire network yet. I'm going to need some more time to figure out if the pipeline will work with the entire network before I'm ready to push anything back into hetio. This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.
This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.
The main reason I avoided the CSV import is that I didn't see a way to losslessly export a graph (in its entirety) like hetio.export_neo4j(). Code in hetio should be general (work for more than just Hetionet v1.0). However, it's fine to have a lossy export that documents it's limitations.
Let's revisit this at a later time when we know more. I would say, if you find yourself constantly copying and pasting the CSV import code, then it would make sense to move it upstream.
Hi Daniel,
Just wanted to ask if we should be revisiting the Neo4j integration code issue. I've since switched over to using the built-in Neo4j CSV loader since it is so much faster, and haven't had any issues with the loss of license metadata so far. It's been working without any issues with the full network so far. Latest code is here. The CSV import method has also been easy to adapt to the matrix DWPC calculation method by @mmayers12 .
Again we've been discussing on our end that any changes we make to the project should be integrated back upstream if it makes sense, so let us know if you're interested in these changes, or if we need to tweak it slightly further before you're willing to pull upstream.
Toby
Hey @veleritas, I see two options for incorporating CSV Neo4j import functionality into hetio.
-
Create a function like
export_neo4j_via_csv, presumably inhetio.neo4j. This would not replaceexport_neo4jbut instead add another route to Neo4j import for users who don't require edge properties to be maintained and would like the speed increase. -
Add a note in the
export_neo4jor in a README that references your code for CSV import. The note would direct users looking for a speedup to check out CSV import.
If you're willing to do the work to submit a PR for option 1, then this is preferable. However, we need to make sure implementations in hetio are modular... so there may be some additional work needed to convert the code from your notebooks. Anyways, I'm happy to help with review and some implementation if needed. This would be a valuable feature, and it would be nice for you to not have to maintain an independent CSV import patch.