How to make a dataset for DeePKS training？

Open yycx1111 opened this issue 1 year ago • 1 comments

Details

I've read tutorials on DeePKS training, but still don't quite understand what it takes in practice to make a dataset for deepks training. Is it an aimd simulation using CP2K or VASP, and converting the resulting file to the training set format? Is there a more detailed tutorial on making a training set?

Have you read the online manual http://abacus.deepmodeling.com/en/latest/

[X] Yes, I have read the online manual.

Task list for Issue attackers (only for developers)

[ ] Identify the specific section of the documentation with the issue.
[ ] Investigate the issue and determine the root cause.
[ ] Research best practices and potential solutions for the identified issue.
[ ] Update the documentation to address the issue, following the suggested improvement.
[ ] Ensure the updated documentation adheres to the project's documentation standards.
[ ] Test the updated documentation to ensure it is clear and accurate.
[ ] Review and incorporate any relevant feedback from users or developers.
[ ] Publish the updated documentation and notify the issue reporter.

Mar 21 '24 09:03 yycx1111

To make a dataset, you need to prepare at least two files: system structure file and energy file. Here we recommand you to use the .npy format, which can easily got by numpy.save() function. Follow the step below to make your dataset.

Get the Structure. Normally this can be achieved by using AIMD simulation. To make your model more robust, choose data points in the conformation space so that they cover the problem under your study as much as possible.
Get the corresponding energy using high-precision functional. Running SCF iteration with high-precision functional for each structure to get their energy, and other labels you may concern (e.g. force, stress).
Change the format. The structure file should be organized using .npy format with shape [nframes, natoms, 4] (in atom.npy). Here nframes is the number of structures (data points) and natoms is the number of atom per structure. The last dimension stands for the nuclear charge and xyz coordinates. The energy file should be in shape of [nframes, 1] (in energy.npy). Note that the units of energy are Hartree and the units of length are Bohr, which means that you may need to take unit conversions.
Split dataset. This step can be done together with the previous step. Randomly split the dataset into two parts: train data and test data. Prepare separate atom.npy and energy.npy files for them.

After these steps, you can safely go for your training and downstream tasks. Notice that .npy is not the only alternative for system structure. In addition to energy, more property labels can be added into model, such as force. For more details, see the website: https://deepks-kit-qi.readthedocs.io/en/latest/label-preperation.html.

Jul 15 '24 03:07 ErjieWu