FFiNet icon indicating copy to clipboard operation
FFiNet copied to clipboard

Issues while preprocessing data

Open borismartirosyandenovo opened this issue 2 years ago • 6 comments

While I try to preprocess the data, I get this error (Segmentation fault (core dumped)). What the cause can be of it?

borismartirosyandenovo avatar Sep 25 '23 21:09 borismartirosyandenovo

It is a sign of runout RAM, but I am unclear whether it is due to the subgraph_index function or the dataset itself problem. Could you provide more details about this problem, e.g. the dataset you are preprocessing and whether the problem occurs when you are using data_generate.features_generating() or data_generate.dataset_creating(target_name=target).

fate1997 avatar Sep 26 '23 11:09 fate1997

Hi, thank you for your response.

I am sorry, but I cannot share the data with you, because its confidential, but I can say that data is a pandas dataframe, which consists from smiles and energy (target with continuous values) columns. The total datapoint quantity is 70000. The problem occurs when I use data_generate.features_generating() function. Thank you in advance!

borismartirosyandenovo avatar Sep 26 '23 14:09 borismartirosyandenovo

Could you split the dataset into several smaller datasets, and see whether the problem still occurs. I doubt this problem may be due to big calculation cost in feature generating process, and to my knowledge, the subgraph_index process and the hydrogen_bond donor/acceptor features in atom features take the longest time when processing, you could also try to not generate hydrogen_bond related features (which has little influence on performance).

fate1997 avatar Sep 26 '23 16:09 fate1997

Ok, will try, thank you!

borismartirosyandenovo avatar Sep 26 '23 16:09 borismartirosyandenovo

I have tried to reduce the amout of datapoints to 10K, but it happens again and I dont have problems with memory because it is 32 GB. Do you have any other suggestions?

borismartirosyandenovo avatar Sep 26 '23 16:09 borismartirosyandenovo

Could you reduce the amount of data points to 1k, even one hundred, and try again (I understand you have a big memory but I still doubt the problem is due to memory explosion). If the problem still occurs, you could send me the wrong data (could be only smiles), and I could have a look. Meanwhile, you could also debug by commenting on the lines related to subgraph_index in data_generating.py file and hydrogen_bond_features_generator in atom_features.py file, to dive into the behind reason for this problem.

fate1997 avatar Sep 27 '23 03:09 fate1997