Improving dataset loader and preprocess script
There are some severe speed issues with the preprocess and data loader script and this oftens makes benchmarking a rather tedious process. I'll work on clearing up the technical debt here (mostly mine 😅).
Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.
Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s
I tried tweaking around with this a bit but I think this is acceptable and hence moving on.
Okay so I have identified a few repeated operations in the data loading pipeline. Right now the flow is
Dataset (edge list) ----(Preprocessor)--> JsonData (add/delete list) ----(Data loader)--> GraphObj (edge list) ---(GraphObjConstructor)--> GPMA format (add/delete list)
I think we can drop the last two conversions and just pipe in the original edge list and add/delete list. @nithinmanoj10 any opinions against this approach?
I'll try making the changes to see if there is some dependency I missed.
Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the
sx-stackoverflowset to cutoff at 20M edges.Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000) [CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s [CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s [CHECKPOINT]::JSON_DUMP in 95.33747601509094sI tried tweaking around with this a bit but I think this is acceptable and hence moving on.
In which file did you benchmark the preprocessing steps? @JoelMathewC
I'm running tests on benchmarking/dataset/preprocessing/preprocess_temporal_data.py. I'll push the changes I made soon. I'm still running a few tests myself.