deepks-kit icon indicating copy to clipboard operation
deepks-kit copied to clipboard

Calculations in 00.scf do not converge, resulting in failure to continue training. raise RuntimeError("No system is avaliable") RuntimeError: No system is avaliable

Open yycx1111 opened this issue 10 months ago • 8 comments

I am using deepks for multi-label training. In the log.data in the 00.scf for each iter, the computation gradually does not converge, eventually leading to not being able to continue training. In iter.02, there is no configuration in 00.scf that can converge the calculation. It causes an error in 01.train of iter.02. Any suggestions for setting and tuning the training parameters?

This is the error message.

data_train/group.00 no system.raw, infer meta from data

data_train/group.00 reset batch size to 0

ignore empty dataset: data_train/group.00

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/train.py", line 303, in cli() File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/main.py", line 71, in train_cli main(**argdict) File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/train.py", line 270, in main g_reader = GroupReader(train_paths, **data_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/reader.py", line 207, in init raise RuntimeError("No system is avaliable") RuntimeError: No system is avaliable

This is the log.data of iter.init, iter.00,iter.01,iter.02. Image Image Image Image

yycx1111 avatar Mar 19 '25 13:03 yycx1111

Have you ever tried to follow the setting suggested by the official doc

Image

ErjieWu avatar Mar 28 '25 09:03 ErjieWu

Have you ever tried to follow the setting suggested by the official doc

Image

No, I have not followed the setting before. I'm testing based on the setting suggested by the official doc now and waiting for the results.

yycx1111 avatar Mar 29 '25 01:03 yycx1111

Have you ever tried to follow the setting suggested by the official doc

Image

After following the official doc settings in init_train, the following error occurs in 01.train. How to solve this error to start training?

data_train/group.00 no system.raw, infer meta from data

data_test/group.01 no system.raw, infer meta from data

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/train.py", line 303, in cli() File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/main.py", line 71, in train_cli main(**argdict) File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/train.py", line 295, in main model = CorrNet(**model_args).double() ^^^^^^^^^^^^^^^^^^^^^ File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/model.py", line 57, in warpper func(self, *args, **kwargs) File "/home/zwj/miniconda3/envs/deepks/lib/python3.12/site-packages/deepks/model/model.py", line 254, in init assert sum(self.shell_sec) == input_dim ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError

yycx1111 avatar Mar 29 '25 05:03 yycx1111

Have you ever tried to follow the setting suggested by the official doc

Image

After setting this, the convergence rate improves in the first few rounds of training, but at iter.03, the convergence rate decreases to 0. Is there any experience or advice on this situation please? Here is the log.data of the training, iter.init, iter.00,iter.01,iter.02,iter.03 respectively. iter.init Image iter.00 Image iter.01 Image iter.02 Image iter.03 Image

yycx1111 avatar Apr 01 '25 02:04 yycx1111

@yycx1111 Maybe you can try the following suggestions in order:

  1. Try the latest code of deepks-kit. Lately we have fixed a bug for band gap label.
  2. Change the factors of force, stress and bandgap to some lower values. I notice that after adding these labels into training, the energy error increases quit a lot.
  3. Change the params in scf_abacus.yaml. Like smaller mixing_beta or larger scf_nmax etc.
  4. Change the start_lr in params.yaml to lower values in iter.init and after.

xuan112358 avatar Apr 01 '25 13:04 xuan112358

@yycx1111 Maybe you can try the following suggestions in order:

  1. Try the latest code of deepks-kit. Lately we have fixed a bug for band gap label.
  2. Change the factors of force, stress and bandgap to some lower values. I notice that after adding these labels into training, the energy error increases quit a lot.
  3. Change the params in scf_abacus.yaml. Like smaller mixing_beta or larger scf_nmax etc.
  4. Change the start_lr in params.yaml to lower values in iter.init and after.

Thank you for your advice. I will keep trying as you suggested. Is this deepks the latest version please? deepks 0.2.dev338+gbf7175b pypi_0 pypi

yycx1111 avatar Apr 01 '25 16:04 yycx1111

@yycx1111 Maybe you can try the following suggestions in order:

  1. Try the latest code of deepks-kit. Lately we have fixed a bug for band gap label.
  2. Change the factors of force, stress and bandgap to some lower values. I notice that after adding these labels into training, the energy error increases quit a lot.
  3. Change the params in scf_abacus.yaml. Like smaller mixing_beta or larger scf_nmax etc.
  4. Change the start_lr in params.yaml to lower values in iter.init and after.

Thank you for your advice. I will keep trying as you suggested. Is this deepks the latest version please? deepks 0.2.dev338+gbf7175b pypi_0 pypi

Oh! The bug is fixed in branch deveop of https://github.com/MCresearch/DeePKS-L/tree/develop. We update the code in that repo recently.

xuan112358 avatar Apr 02 '25 03:04 xuan112358

@yycx1111 Maybe you can try the following suggestions in order:

  1. Try the latest code of deepks-kit. Lately we have fixed a bug for band gap label.
  2. Change the factors of force, stress and bandgap to some lower values. I notice that after adding these labels into training, the energy error increases quit a lot.
  3. Change the params in scf_abacus.yaml. Like smaller mixing_beta or larger scf_nmax etc.
  4. Change the start_lr in params.yaml to lower values in iter.init and after.

Thank you for your advice. I will keep trying as you suggested. Is this deepks the latest version please? deepks 0.2.dev338+gbf7175b pypi_0 pypi

Oh! The bug is fixed in branch deveop of https://github.com/MCresearch/DeePKS-L/tree/develop. We update the code in that repo recently.

After updating the code, the convergence rate of scf calculation for iter.init, iter.00, iter.01 are able to reach 1. But after that the scf situation of iter is still not good. The current parameter tuning has not yet achieved good results. When analysing the output file of 01.train, there are some questions.

  1. “test.out” file of real_ene and pred_ene refers to which two energy difference between them, why the real_ene is different for different iter?
  2. Is it possible to judge the 00.scf situation of the next round of iter based on the output file of 01.train, because the scf calculation takes a long time. It takes a lot of time to adjust the parameters and then scf calculation. Any suggestion for adjusting parameters?Is there any basis to refer to for adjusting the parameters?

yycx1111 avatar Apr 15 '25 12:04 yycx1111