STGraph icon indicating copy to clipboard operation
STGraph copied to clipboard

Improving dataset loader and preprocess script

Open JoelMathewC opened this issue 2 years ago • 4 comments

There are some severe speed issues with the preprocess and data loader script and this oftens makes benchmarking a rather tedious process. I'll work on clearing up the technical debt here (mostly mine 😅).

JoelMathewC avatar Jan 20 '24 10:01 JoelMathewC

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.

Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s

I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

JoelMathewC avatar Jan 20 '24 10:01 JoelMathewC

Okay so I have identified a few repeated operations in the data loading pipeline. Right now the flow is

Dataset (edge list) ----(Preprocessor)--> JsonData (add/delete list) ----(Data loader)--> GraphObj (edge list) ---(GraphObjConstructor)--> GPMA format (add/delete list)

I think we can drop the last two conversions and just pipe in the original edge list and add/delete list. @nithinmanoj10 any opinions against this approach?

I'll try making the changes to see if there is some dependency I missed.

JoelMathewC avatar Jan 21 '24 07:01 JoelMathewC

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.

Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s

I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

In which file did you benchmark the preprocessing steps? @JoelMathewC

nithinmanoj10 avatar Jan 21 '24 12:01 nithinmanoj10

I'm running tests on benchmarking/dataset/preprocessing/preprocess_temporal_data.py. I'll push the changes I made soon. I'm still running a few tests myself.

JoelMathewC avatar Jan 22 '24 04:01 JoelMathewC