cclib adding metadata dictionary to cjsonwriter

Here is a quick addition to the cjson writer which would allow the metadata dictionary to be output as its own metadata object. This is being discussed as a desired feature in both #1105 and #1143. I haven't used cjson in my own work so this solution may not be in the right format. Just wanted to discuss a possible approach to incorporating this.

Jul 24 '22 18:07 amandadumi

Codecov Report

Patch coverage: 65.00% and project coverage change: +0.01 :tada:

Comparison is base (d533158) 87.79% compared to head (a077cf7) 87.80%.

:exclamation: Current head a077cf7 differs from pull request most recent head a7f2547. Consider uploading reports for the commit a7f2547 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1148      +/-   ##
==========================================
+ Coverage   87.79%   87.80%   +0.01%     
==========================================
  Files          64       69       +5     
  Lines       13435    13535     +100     
==========================================
+ Hits        11795    11885      +90     
- Misses       1640     1650      +10

Impacted Files	Coverage Δ
cclib/parser/data.py	`80.80% <ø> (ø)`
cclib/io/cjsonwriter.py	`89.44% <65.00%> (-4.18%)`	:arrow_down:

... and 6 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Jul 24 '22 18:07 codecov[bot]

This looks fine to me, but we (probably long ago) have veered off the CJSON format. I used this snippet on the result of ccwrite cjson data/ORCA/basicORCA5.0/dvb_rocis.out and it looks ok; the metadata block is appropriately ignored. ORCA dvb_ir.out causes it to segfault, though it does that even on output from before this PR.

Unless we can do some sort of standardization with Marcus (who is definitely still using modified CJSON) and Geoff (who is reworking spectroscopy representation among other things I don't recall), we should have even just a comment in docs that this is "cclib-extended Chemical JSON".

Jul 26 '22 04:07 berquist

Let's talk over Slack or on the Avogadro forum. I think adding the metadata block is a good idea.

I'm not sure if Marcus is doing anything with CJSON right now, it seems as if Brookhaven is keeping him occupied, as well as Tomviz.

Jul 26 '22 15:07 ghutchis

I'd certainly like to see something in which reading in the metadata will help configure the input generator GUI (e.g., load an MP2 calculation, and it shows an MP2 calculation by default).

Jul 26 '22 15:07 ghutchis

Here's a more recent reference point on the cjson code: https://github.com/OpenChemistry/avogadro-cclib/blob/main/cclibScript.py

I added some support for metadata here: https://github.com/OpenChemistry/avogadro-cclib/blob/4a2524fb0c17ad72890a9c8fbe915e421ffa0171/cclibScript.py#L198

But IIRC I didn't add anything on the Avogadro side to parse it yet.

Suggestions welcome .. in particular, I'd like to add total charge and spin multiplicity soon, probably under "properties" to properly roundtrip files (e.g., for bond perception, etc.)

Jul 26 '22 17:07 ghutchis

I'm very open to standardization, so there can be something relatively well defined when Avogadro 2.0 is released.

One relevant example from the chemicaljson repository (https://github.com/OpenChemistry/chemicaljson/blob/master/examples/water_psi4_oc.cjson)

has:

"inputParameters": {
    "basis": "6-31g",
    "functional": "b3lyp",
    "task": "energy",
    "theory": "dft"
  },

This seems like an extremely good idea to enable users to "round trip" calculations. A few other useful metadata for the inputParameters

[ ] dispersion correction type (D30, D3BJ, D4, ..)
[ ] solvation model(?)

The "task" key is obviously not in metadata but most should be fairly easy to deduce from cclib:

[ ] energy (single point = default)
[ ] optimization (multiple optimization steps exist)
[ ] frequencies (vibfreqs exist)
[ ] transition state
[ ] excitations (etenergies exist)

I'm also open to adding more "task" options.

These all seem like they'd be useful to pull out into the "inputParameters" key.. to be used by the Python input generator scripts.

The other metadata .. I'm not sure whether it's better to have something under "properties" or as you've proposed having a new key "metadata."

Aug 03 '22 20:08 ghutchis

I went through the pull requests (#1143, #1142, #1141) and collected the data that they would want to introduce into the cjson, initially proposed as additions to a metadata section. As @ghutchis mentioned, many of these would fit in inputParameters field. There are a few that I am unclear on where they should be placed, I've indicated this with (?) following the data. Specifically, information about hardware related information.

basis_set (already exists inputParameters) calc_type (already exists in inputParameters, but would be renamed to theory) dispersion (add to inputParameters) functional (inputParameters) grid (?, this is a string describing the grid type used ) keywords_line (add to inputParameters) memory (?) processors (?) qm_program (maybe inputParameters) run_date (?) solvation (could be added to inputParameters)

Aug 24 '22 01:08 amandadumi

Most of these seem useful for inputParameters. Way back when there was the JSON planning meeting, I suggested that the format needed to be good enough to work both as input and output - since many users might want a format as part of a workflow (e.g., post-calculation analysis, localized MOs, etc.). Besides that, most output formats include the input parameters.

We also discussed that "extensions" were probably useful / needed for specific program options. (For example, I'm not sure that Grid5 from Orca corresponds to anything specific in other programs.)

Let me clean up the CJSON repo, post the draft schema and I'll submit a patch here based on the avogadro-cclib reader.

I think my only question is the run_date which feels much more like metadata to me, not inputParameters or properties.

Aug 24 '22 02:08 ghutchis

This sounds like a plan, thank you. I'll work on placing these keywords in the right place then.

I do agree that run_date does fit best in metadata.

Aug 27 '22 23:08 amandadumi

Hi No pressure just bumping this. Have you already addressed Eric's review locally?

May 30 '23 23:05 shivupa