iodata icon indicating copy to clipboard operation
iodata copied to clipboard

General API for preparing IOData object before dumping

Open tovrstra opened this issue 5 years ago • 1 comments

Motivation

At the moment, it is mostly assumed that an IOData instance contains all the right attributes in the right form before it is passed on to a dump_one call. Some file formats (WNF, WFX and potentially also FCHK) modify the IOData object to become compatible with the format. Typical modifications include:

  • WFX/WFN: converting basis to Cartesian functions (with transformation of the MO coeffs)
  • WFX/WFN: decontracting the basis (with transformation of the MO coeffs)
  • FCHK: convert (natural) orbitals to density matrix, recontraction of basis sets.
  • In general, possibly too far fetched, reverse-engineering contractions from WFN/WFX files.

This is problematic for several reasons:

  • It makes dump_one functions long and complicated.
  • Users may not be aware of the conversion taking place, which may result in loss of information.
  • It may sometimes be of interest to disable conversions, e.g. when they are optional or when the user does not want any conversions (and prefers an exception to be raised instead). The latter is typical when dealing with conversions of large data sets, where data preservation is desirable and unintended loss of information due to conversion is not wanted.
  • Some of the current conversions introduce redundant data, which results in inefficient use of storage.

See also:

  • https://github.com/theochem/iodata/pull/164#discussion_r467966558
  • https://github.com/theochem/iodata/pull/252#pullrequestreview-2099358284

Proposal

  • Add an optional prepare_dump function to the fileformat modules. If present, it takes and IOData instance as argument, and returns a potentially modified one. The given IOData instance is not modified. An option allow_changes=False should be added, to allow disabling any conversion. If this flag is set to False and the file cannot be written without conversion, an exception is raised. If this flag is set to True, a warning is emitted when a conversion is applied.

  • The dump_one and dump_many in the file formats functions call the new prepare_dump function before dumping.

  • Add a allow_changes=False option to the dump_one and dump_many functions in the file formats modules. This is passed on to the prepare_dump function. The dump functions return the potentially modified IOData instance(s).

  • Add a allow_changes=False option to the dump_one and dump_many functions in the module iodata.api. This is passed on to the dump functions of the selected file format. Also these dump functions return the potentially modified IOData instance(s).

  • Factor out some of the reusable utility functions to modify the IOData object, e.g. manipulations of basis set and corresponding changes to MO coefficients.

  • Add an option to the script iodata-convert to enable or disable modifications before dumping.

  • Add basic sanity check to dump_one and dump_many that required attributes are not None before creating a file. Such missing attributes will raise an error, and may result in overwriting the output with an empty file, which is never useful and may ocassionally lead to data loss. This type of pre-flight check could be added to prepare_dump, but it is better to write one general implementation for all file formats, so it is always checked.

TODO list

  • [x] Validate that required fields are present in dump_one and dump_many. In dump_one, this can be done before even creating the file. In dump_many this check is possible on the first frame before creating a file, not for later ones. See #337
  • [x] Implement a light version of the prepare_dump API and use it for validity checks: JSON, FCHK, WFX, WFN, Molden, Molekel. This does not add any actual conversion yet. (These will be implemented in later pull requests.) See #344
  • [x] Split FileFormatError into LoadError and DumpError and update formats modules to use these consistently. The current use of exceptions in the formats module is not consistent. Idem for FileFormatWarning. See #345
  • [x] Replace lit.error by raising LoadError directly and extend LoadError with the logic in lit.error. Update contributor guide accordingly. See #348
  • [x] Replace lit.warn by direct warnings.warn() using an improved LoadWarning class that contains the logic now implemented in lit.warn(). Also extend DumpWarning with file or filename argument, like LoadWarning. Do not use lit.warn Explain LoadWarning in the contributor guide. See #349
  • [x] Extend DumpError with file or filename argument, like LoadError.Directly subclass exceptions from Exception instead of ValueError. (The latter improves testing when using pytest.raises.) See #349
  • [x] Convert MOs to unrestricted if format does not support occs_aminusb: WFN, WFX, Molden, Molekel. #352
  • [x] Replace catch-all constructs like warnings.catch_warnings() in unit tests by more specific warnings. See #353
  • [x] Make optional arguments in iodata.api mandatory keywords, by inserting *, in the argument list. See #355
  • [x] Rename argument iodata in iodata.api to data for consistency with the rest of the code. See #356
  • [x] Turn Shell attributes into arrays (now lists) with converter functions, in analogy to IOData attributes. #371
  • [x] Move convert_* functions from basis and orbitals to convert module, which becomes the lower-level analog of the prepare module. It does similar things, but without the context of dumping data to files. #372
  • [x] Write a prepare function to segment a basis before dumping. This can be used by the following formats: Molden, WFN, WFX, FCHK (except for SP shell), Molekel. There should be an option to leave SP shells in place while segmenting all others. At the same time, fix the third point in #256. See #373
  • [ ] Write prepare function to sort shells by center. (Molden and Molekel assume this.)
  • [ ] Decontract basis and convert MOs in prepare_dump for WFN and WFX. This would also fix #258.
  • [ ] Convert MOs to Cartesian basis in prepare_dump for WFN and WFX. This would also fix #259.
  • [ ] Make all convert functions consistent: when no changes are needed, they return reference to the input objects.
  • [x] Add --allow-changes option to command-line interface. See #374
  • [ ] Extend the contributor guide with the following:
    • Explain how to raise errors and warnings in dump_one
    • Explain how to write a prepare_dump function, and how to raise errors and warnings
  • [x] Split getting started into four pages: command-line, loading, dumping and input writing, as also mentioned in #210. See #351
  • [ ] Expand getting started page on command-line usage to illustrate the --allow-changes option.
  • [ ] Expand getting started page on dumping, to illustrate the allow_changes keyword argument.

tovrstra avatar Aug 23 '20 09:08 tovrstra

Another example of required conversion is discussed in #252: many formats do not support restricted orbitals with "unrestricted occupation numbers". In this case, the orbitals need to be converted to unrestricted form to be able to write a file.

tovrstra avatar Jun 06 '24 15:06 tovrstra