RFC: Refactor DPGEN2 with a new design
Hi community,
This RFC is about a proposal to refactor DPGEN workflow with a new design based on DFlow
A typical DPGEN2 configuration is like the below: https://github.com/deepmodeling/dpgen2/blob/master/examples/chno/input.json
IMHO there are some issues in the configuration:
- The context (executor, container, etc) configuration is mix with the configuration of algorithm
- It is hard to validate such configuration with tool like
pydantic, which would be error prone - Data files are not allowed to carry their own configuration, which makes it hard to training different systems at the same time.
A suggested pseudo configuration design is like the below, which borrow some ideas from ai2-kit project.
This configuration is supposed to be more formal and clean to maintain.
# executor configuration
executor:
bohrium: ...
# dflow configuration for each software
dflow:
python:
container: ai2-kit/0.12.10
python_cmd: python3
deepmd:
container: deepmd/2.7.1
dp_cmd: dp
lammps:
container: deepmd/2.7.1
lammps_cmd: lmp
cp2k:
container: cp2k/2023.1
cp2k_cmd: mpirun cp2k.psmp
# declare file resources as datasets before use them
# so that we can assign extra attributes to them
datasets:
dpdata-Ni13Pd12:
url: /path/to/data
format: deepmd/npy
sys-Ni13Pd12:
url: /path/to/data
includes: POSCAR*
format: vasp
attrs:
# allow user to defined system-wise configuration
# so that we can explore multiple types of systems in an iteration
lammps:
plumed_config: !load_text plumed.inp # use custom yaml tags to embed data from other file
cp2k:
input_template: !load_text cp2k.inp
workflow:
general:
type_map: [C, O, H]
mass_map: [12, 16, 1]
max_iters: 5
train:
deepmd:
init_dataset: [dpdata-Ni13Pd12]
input_template: !load_yaml deepmd.json # use custom yaml tags to embed data from other file
explore:
# instead of using `type: lammps` to specific different software
# specific a dedicated entry for different softwares of the same stage
# so that we can use pydantic to validate the configuration item
# and lead to a better code structure:
# https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L163-L293
lammps:
nsteps: 10
systems: [ sys-Ni13Pd12 ] # reference dataset via key
# support different way of variable combination strategies to avoid combination explosion
# vars defined in `explore_vars` will combines with system_files with Cartesian product
# vars defined in `broadcast_vars` will just broadcast to system_files
# this design is useful if there are a lot of file
explore_vars:
TEMP: [330, 430, 530]
broadcast_vars:
LAMBDA_f: [0.0, 0.25, 0.5. 0.75. 1.0]
template_vars:
POST_INIT: |
neighbor bin 2.0
plumed_config: !load_text plumed.inp
# isolated select stage from explore so that we can implement more complex structure selection algorithm
select:
model_devi:
decent_f: [0.12, 0.18]
limit: 50
label:
cp2k:
input_template: !load_text cp2k.inp
next:
# specify configuration for next iteration
# it will merge with the current configuration as a new configuration file for next round
config: !load_yml iter-001.yml
The above configuration is easy to validate with pydantic, for example:
https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L32-L111
I believe a better design of configuration will lead to a better software design. I post my thoughts for the community to review, and it would be appreciated to get some feedbacks.
For the first point, it is quite easy to put the machine-related configurations together or in a separate file, in my opinion, this is a minor point. For the second point, dpgen2 uses dargs for configuration checks and validation as well as automatic documentation generation (please refer to https://docs.deepmodeling.com/projects/dargs/en/stable/#https://), which can play a role similar to pydantic and support some custom features. Is there anything that does not meet your requirements? For the third point, dpgen2 supports multiple datasets using different lammps input templates during the configuration exploration phase. E.g.
"configurations": [
{
"type": "alloy",
"lattice" : ["fcc", 4.57],
"replicate" : [2, 2, 2],
"numb_confs" : 30,
"concentration" : [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]]
},
{
"type" : "file",
"prefix": "/file/prefix",
"files" : ["relpath/to/confs/*"],
"fmt" : "deepmd/npy"
}
],
"stages": [
[
{
"_comment" : "stage 0, task group 0",
"type" : "lmp-md",
"ensemble": "nvt", "nsteps": 50, "temps": [50, 100], "trj_freq": 10,
"conf_idx": [0], "n_sample" : 3
},
{
"_comment" : "stage 0, task group 1",
"type" : "lmp-template",
"lmp" : "template.lammps", "plm" : "template.plumed",
"trj_freq" : 10, "revisions" : {"V_NSTEPS" : [40], "V_TEMP" : [150, 200]},
"conf_idx": [0], "n_sample" : 3
}
],
[
{
"_comment" : "stage 1, task group 0",
"type" : "lmp-md",
"ensemble": "npt", "nsteps": 50, "press": [1e0], "temps": [50, 100, 200], "trj_freq": 10,
"conf_idx": [1], "n_sample" : 3
}
]
]
Here, you can use different LAMMPS template files for different conf_idx. Is there anything that does not meet your requirements?
Hi @zjgemi It is not only system-wise LAMMPS configuration is required, but also CP2K. You may check the detail in the pseudo configuration.