Issues while preprocessing data
While I try to preprocess the data, I get this error (Segmentation fault (core dumped)). What the cause can be of it?
It is a sign of runout RAM, but I am unclear whether it is due to the subgraph_index function or the dataset itself problem. Could you provide more details about this problem, e.g. the dataset you are preprocessing and whether the problem occurs when you are using data_generate.features_generating() or data_generate.dataset_creating(target_name=target).
Hi, thank you for your response.
I am sorry, but I cannot share the data with you, because its confidential, but I can say that data is a pandas dataframe, which consists from smiles and energy (target with continuous values) columns. The total datapoint quantity is 70000. The problem occurs when I use data_generate.features_generating() function. Thank you in advance!
Could you split the dataset into several smaller datasets, and see whether the problem still occurs. I doubt this problem may be due to big calculation cost in feature generating process, and to my knowledge, the subgraph_index process and the hydrogen_bond donor/acceptor features in atom features take the longest time when processing, you could also try to not generate hydrogen_bond related features (which has little influence on performance).
Ok, will try, thank you!
I have tried to reduce the amout of datapoints to 10K, but it happens again and I dont have problems with memory because it is 32 GB. Do you have any other suggestions?
Could you reduce the amount of data points to 1k, even one hundred, and try again (I understand you have a big memory but I still doubt the problem is due to memory explosion). If the problem still occurs, you could send me the wrong data (could be only smiles), and I could have a look. Meanwhile, you could also debug by commenting on the lines related to subgraph_index in data_generating.py file and hydrogen_bond_features_generator in atom_features.py file, to dive into the behind reason for this problem.