[Feature Request] Add support for symbolic atom, bond, angle, dihedral and improper types
Summary
Right now, force field types in LAMMPS are numbers starting from 1. Allowing symbolic (text) based atom types would make a lot of tasks easier like merging data from multiple simulations, supporting reading of existing topology/parameter file formats from CHARMM or Amber, and writing complex inputs. On the other hand, some of the flexibility of LAMMPS stems from the size of data structures being flexible, but then locked in when the simulation cell is created. This causes some conflicts for which some solution has to be implemented that is acceptable for when symbolic atom types are used.
Detailed Description
Given the pervasive nature of such a feature, this has to be implemented into the core of LAMMPS. Given the large body of existing work and tools, it also has to be implemented in a way to ensure backward compatibility. And it would need to be implemented incrementally by first adding the facility to manage the symbol to type and reverse mapping and then gradually adapt styles and other code to utilize it.
A key goal would be to have the ability to directly read topology/parameter files from either CHARMM or Amber (parameter and psf files and parmtop files, respectively) or read a file format that is available and well supported for which a well supported converter tool exists for conversion without loss of information (most tools lose some information though). In some way it should be possible to auto-generate types for bonded interactions from atomic types. There also should be a command to populate the force field type database from LAMMPS input script.
The major constraint to consider is that one cannot change the number of types after the box is defined (outside of completely deleting the simulation instance with "clear"). A strategy to deal with that is to reserve more atom types than actually needed and have a "default" or "disabled" set of force field parameters (a NULL atom or bond types), so that the checking for unset atom types will not fail due to missing input of parameters.
One strategy would be to allow setting symbolic types BEFORE the box is defined. e.g. through a command like species atom add OW that would then later allow to either explicitly associate a numerical atom type with a symbolic type (species atom set type OW 1) or do this automatically. When creating the box, instead of the number of atom types the keyword AUTO could be used
and then the number would be the number of entries in the table of symbolic types.
We allow have other things associated with those tables, e.g. force field parameters (species atom coeff OW lj 0.15535 3.166). In most force fields, the mixed terms are derived from mixing rules, but that could be overridden as well (e.g. to support NBFIX in CHARMM) with a specific command (species atom mixed OW OH lj 0.0 3.0)
The equivalent would be done for bonds/angles/etc. that would allow to read a CHARMM parameter file (except for cross terms, but that is a special case anyway).
To support data and molecule files with symbolic types instead of numeric, there could be commands like species data some.data and species molecule some.molecule commands that would extract symbolic type and coefficient info from data and molecule files as available. Similarly more readers for extracting data could be implemented: species charmm xxx.par, species amber xxx.parmtop, species psf xxx.psf etc.
As soon as the box is created, the database of force field types has a constraint as to what types can be associated with any symbolic types.
To be backward compatible, and to handle numeric types and create_box with setting the number of atom types, the database would always contain a NULL type and by default numerical atom types would be associated with that type. This would then require different "sanity checks", so that when you now get an error that pair coefficient are missing, those checks have to be extended to check against the mapping to symbols, too.
For the commands that otherwise require numerical (atom) types, we can accept strings (if present in the type database) and just have the database return the number that a symbol is associated with.
The symbols could then also be enabled on output and e.g. replace the type to element mapping in dump files, or the write_data command can be instructed to replace the numbers with strings.
in case it helps (one way or another), recall the related effort implemented here #1004
@sjplimp @akohlmey I expect that sticking points in getting this started successfully will be 1) format of the involved files 2) how/where to modify the code that supports symbolic reading and writing. In previous attempts, I modified the 'write_data' routines of every pair style, to support both numeric and symbolic types, and added a flag to the existing 'data' commands to specify whether or not use symbolic types. Perhaps, it would better to add separate routines for a 'species write_data' command? At least, this would more obviously not impact any existing behavior.
Below, I propose a format for the 'species data' file. It resembles a force field file, but the atom, bond, etc. types are also numbered (arbitrarily), to allow for a more condensed listing of bonds, angles, etc.
this is LAMMPS species data file
44 atoms 11 atom types 42 bonds 15 bond types 74 angles 29 angle types
Pair Coeffs # lj/class2/coul/cut
1 cp 0.054 4.01 2 hc 0.12 3.81 3 ct 0.054 4.01 ...
Bond Coeffs # class2
1 cp hc 1.101 345 -691.89 844.6 2 cp ct 1.53 299.67 -501.77 679.81 ...
Angle Coeffs # class2
1 cp ct hc 107.66 39.641 -12.921 -2.4318 ...
Atoms # full
1 1 1 0.02 12.288168 0.738732 4.374280 ...
Bonds
1 1 1 5 2 1 1 4 ...
Angles
1 1 5 1 4 2 2 7 1 5 3 2 8 1 5 ...
In other words, rather than a single string, bonds are a string doublet, angles are a string triplet, etc. 'Species data' and 'species molecule' commands will identify the same types based on these sets of strings. This type of identification will be invaluable when automatically typing certain force field interactions during a simulation. Once a format is agreed upon, it will be possible to add this read/write symbolic types capability without modifying many existing lines of code (mostly just adding lines), and github will of course allow for keeping up to date with concurrent changes to these files.
@sjplimp @akohlmey I expect that sticking points in getting this started successfully will be 1) format of the involved files 2) how/where to modify the code that supports symbolic reading and writing. In previous attempts, I modified the 'write_data' routines of every pair style, to support both numeric and symbolic types, and added a flag to the existing 'data' commands to specify whether or not use symbolic types. Perhaps, it would better to add separate routines for a 'species write_data' command? At least, this would more obviously not impact any existing behavior.
your proposal is not sufficient. you are implicitly requiring that the types for bonded interactions are derived from the constituent atom types. the LAMMPS code is by construction more general and thus it has to allow to override this.
Below, I propose a format for the 'species data' file. It resembles a force field file, but the atom, bond, etc. types are also numbered (arbitrarily), to allow for a more condensed listing of bonds, angles, etc.
if we introduce a new format, we should not use something that looks similar to but is not compatible with regular data files. there are two main reasons: first off, people will expect those to be compatible and be confused when tools expecting numbers choke or crash on the strings. second - and most important - this is a horrible file format (same as the data file format) for parsing with external tools (well, the parser in LAMMPS is needlessly convoluted as well), since it is not self-descriptive: its specific format depends on details not encoded in the file itself, so additional information is required or has to be guessed to be able to parse it correctly.
your proposal is not sufficient. you are implicitly requiring that the types for bonded interactions are derived from the constituent atom types. the LAMMPS code is by construction more general and thus it has to allow to override this.
Are you referring to my suggestion to use set of strings for types? If so, this does not make it any less general than using a single string. If it must be one or the other, using sets of strings will cleanly solve the problem of 'auto-generating types for bonded interactions from atomic types' in a way that is compatible with the way the majority of force fields are formatted (in whose case, being forced to choose a single arbitrary symbol for bond types adds unnecessary complexity).
if we introduce a new format, we should not use something that looks similar to but is not compatible with regular data files. there are two main reasons: first off, people will expect those to be compatible and be confused when tools expecting numbers choke or crash on the strings. second - and most important - this is a horrible file format (same as the data file format) for parsing with external tools (well, the parser in LAMMPS is needlessly convoluted as well), since it is not self-descriptive: its specific format depends on details not encoded in the file itself, so additional information is required or has to be guessed to be able to parse it correctly.
it sounds like you want to use this self-descriptive file format for regular data files as well. if so, that sounds like a somewhat separate issue which would prolong or prevent adding support for symbolic types, if a simpler text file is never supported in this case. assuming that regular data files will continue to be supported or perhaps be very slowly phased out, are going we have to settle for two formats (for regular data files)? regarding 'species data' files, it may actually be helpful that it is formatted similarly to regular data files, in terms of adapting existing tools that read regular data files
Are you referring to my suggestion to use set of strings for types? If so, this does not make it any less general than using a single string. If it must be one or the other, using sets of strings will cleanly solve the problem of 'auto-generating types for bonded interactions from atomic types' in a way that is compatible with the way the majority of force fields are formatted (in whose case, being forced to choose a single arbitrary symbol for bond types adds unnecessary complexity).
if you want to have the force field type automatically inferred from the constituent atom types, it should be clearly flagged as such, e.g. with a type called AUTO. having atom types replicated in the bonded interaction types makes manually editing such a data file a nightmare. you also need to consider that many force fields make use of wildcards (e.g. the outer atoms of a dihedral) or may have multiple entries for the same bonded interaction. having a single string makes for a simple way for inter-format compatibility, i.e. the number that is currently used is just a special case of a string. there are many pitfalls with trying to get this off the ground and we have to take into account all the unexpected things that people will try to do once pandora's box is open. the problem with such changes in a code like LAMMPS is that you will have to live with them for a very long time.
if we introduce a new format, we should not use something that looks similar to but is not compatible with regular data files. there are two main reasons: first off, people will expect those to be compatible and be confused when tools expecting numbers choke or crash on the strings. second - and most important - this is a horrible file format (same as the data file format) for parsing with external tools (well, the parser in LAMMPS is needlessly convoluted as well), since it is not self-descriptive: its specific format depends on details not encoded in the file itself, so additional information is required or has to be guessed to be able to parse it correctly.
it sounds like you want to use this self-descriptive file format for regular data files as well. if so, that sounds like a somewhat separate issue which would prolong or prevent adding support for symbolic types, if a simpler text file is never supported in this case. assuming that regular data files will continue to be supported or perhaps be very slowly phased out, are going we have to settle for two formats (for regular data files)? regarding 'species data' files, it may actually be helpful that it is formatted similarly to regular data files, in terms of adapting existing tools that read regular data files
it is near impossible to change well established formats and features in LAMMPS in a way that is not backward compatible. thus there is only a minimal chance to have that replaced, but then again, there is no need to do that since one can devise a new command, if desired (it is only more work).
that said, in my experience, a file format that is similar looking but not compatible is worse than one that looks different. it does not simplify supporting it but makes it harder and makes it more difficult to convey to users that it is different and not compatible. and i will object to any new file format that inherits the problems of existing ones (which doesn't mean that they won't get added, since I am not the person making the final decision)
if you want to have the force field type automatically inferred from the constituent atom types, it should be clearly flagged as such, e.g. with a type called AUTO.
right, but how to make an inference, other than knowing which two atom types correspond to that bond type?
having atom types replicated in the bonded interaction types makes manually editing such a data file a nightmare. you also need to consider that many force fields make use of wildcards (e.g. the outer atoms of a dihedral) or may have multiple entries for the same bonded interaction. having a single string makes for a simple way for inter-format compatibility, i.e. the number that is currently used is just a special case of a string.
in most cases, I suspect bonded interactions will be named after the constituent atom types, anyway. coincidentally, using wildcards for the outer atoms of a dihedral is precisely what brought me back to this topic. for the use case i think you are describing, an asterisk could be used within, or as the entirety of, one of the strings, in the set of strings. for my use case, this format provides a simple way to automate the re-assignment of dihedrals
it is near impossible to change well established formats and features in LAMMPS in a way that is not backward compatible. thus there is only a minimal chance to have that replaced, but then again, there is no need to do that since one can devise a new command, if desired (it is only more work).
that said, in my experience, a file format that is similar looking but not compatible is worse than one that looks different. it does not simplify supporting it but makes it harder and makes it more difficult to convey to users that it is different and not compatible. and i will object to any new file format that inherits the problems of existing ones (which doesn't mean that they won't get added, since I am not the person making the final decision)
Okay. as a mitigating effort, it could be interesting to encourage or force a specific extension when reading/writing these new 'species' data files
@akohlmey @jrgissing Three high-level Qs. Just considering atom types. And focusing on what would be most useful for users.
-
Which commands would benefit most from allowing use of strings in place of numeric types?
-
Is the goal to define one mapping of numbers <-> strings, then enable use of either flavor of atom type anywhere in the input script, all input files, all output files? Or something less than that?
-
Is it important to allow the user to define the mapping, or it could be auto generated? Is it OK to require the final mapping to define consecutive numbers, or do you want arbitrary numbers (gaps between them)?
- Which commands would benefit most from allowing use of strings in place of numeric types?
Generally speaking, using strings in the read_data command will help any user who wishes for more human-readability, reusability and interoperability of their data files (and molecule templates). Secondly, any command that inputs atom types would be more generally valid, for a given force field.
- Is the goal to define one mapping of numbers <-> strings, then enable use of either flavor of atom type anywhere in the input script, all input files, all output files? Or something less than that?
Initially, I would shoot for just input/output files. Regarding the input script, the corresponding numeric types are retained explicitly, in the format above (at least when using a single data file, see below). Secondly, I would consider supporting strings in the input script, starting with commands where they are most helpful.
- Is it important to allow the user to define the mapping, or it could be auto generated? Is it OK to require the final mapping to define consecutive numbers, or do you want arbitrary numbers (gaps between them)?
Above, I suppose I am proposing to have users define the numbers <-> strings mapping. This may be important for backwards compatibility, and does not limit the usefulness of the feature, unless they must also do so when reading multiple data files. In this special case, (numeric) types in the second data file would be overridden, and if necessary the final mapping could be obtained by writing out a 'species data' file. Yes, I think the final mapping would certainly have to be consecutive numbers.
as already stated in the previous discussions, I disagree with the approach by @jrgissing and have outlined my proposed strategy at the top of the issue. I believe his proposal will limit what can be done or will require significant rewrites to fully realize the potential of using symbolic types in LAMMPS.
as already stated in the previous discussions, I disagree with the approach by @jrgissing and have outlined my proposed strategy at the top of the issue. I believe his proposal will limit what can be done or will require significant rewrites to fully realize the potential of using symbolic types in LAMMPS.
it is difficult for me to comment/compare, without knowing key aspects of your proposal, such as the format of the data file
after looking at several tools that read/write lammps data files, it seems trivial to support the new 'species data' file format above, and this new format could be detected automatically
as already stated in the previous discussions, I disagree with the approach by @jrgissing and have outlined my proposed strategy at the top of the issue. I believe his proposal will limit what can be done or will require significant rewrites to fully realize the potential of using symbolic types in LAMMPS.
it is difficult for me to comment/compare, without knowing key aspects of your proposal, such as the format of the data file
after looking at several tools that read/write lammps data files, it seems trivial to support the new 'species data' file format above, and this new format could be detected automatically
i consider it a mistake to look at this feature from the perspective of the file format and build the rest around it. for me the core is the facility to handle the processing for the symbolic types and associating them with force field parameters and handle all relevant processes related to that. that needs to be done first and it needs to be done in a way that it allows to use symbolic types in multiple places in LAMMPS for multiple current and future features and in a more general way than what you are currently looking at. i have no interest to establish something that is only a partial solution and that may interfere with the primary motivation for supporting symbolic types. i have discussed this with @sjplimp offline and if you want to move forward without my consent, you need to discuss with him. he has the final decision. whether what you propose is easy to implement or not is irrelevant to me. i have nothing else to say for now.