adding metadata dictionary to cjsonwriter
Here is a quick addition to the cjson writer which would allow the metadata dictionary to be output as its own metadata object. This is being discussed as a desired feature in both #1105 and #1143. I haven't used cjson in my own work so this solution may not be in the right format. Just wanted to discuss a possible approach to incorporating this.
Codecov Report
Patch coverage: 65.00% and project coverage change: +0.01 :tada:
Comparison is base (
d533158) 87.79% compared to head (a077cf7) 87.80%.
:exclamation: Current head a077cf7 differs from pull request most recent head a7f2547. Consider uploading reports for the commit a7f2547 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #1148 +/- ##
==========================================
+ Coverage 87.79% 87.80% +0.01%
==========================================
Files 64 69 +5
Lines 13435 13535 +100
==========================================
+ Hits 11795 11885 +90
- Misses 1640 1650 +10
| Impacted Files | Coverage Δ | |
|---|---|---|
| cclib/parser/data.py | 80.80% <ø> (ø) |
|
| cclib/io/cjsonwriter.py | 89.44% <65.00%> (-4.18%) |
:arrow_down: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This looks fine to me, but we (probably long ago) have veered off the CJSON format. I used this snippet on the result of ccwrite cjson data/ORCA/basicORCA5.0/dvb_rocis.out and it looks ok; the metadata block is appropriately ignored. ORCA dvb_ir.out causes it to segfault, though it does that even on output from before this PR.
Unless we can do some sort of standardization with Marcus (who is definitely still using modified CJSON) and Geoff (who is reworking spectroscopy representation among other things I don't recall), we should have even just a comment in docs that this is "cclib-extended Chemical JSON".
Let's talk over Slack or on the Avogadro forum. I think adding the metadata block is a good idea.
I'm not sure if Marcus is doing anything with CJSON right now, it seems as if Brookhaven is keeping him occupied, as well as Tomviz.
I'd certainly like to see something in which reading in the metadata will help configure the input generator GUI (e.g., load an MP2 calculation, and it shows an MP2 calculation by default).
Here's a more recent reference point on the cjson code:
https://github.com/OpenChemistry/avogadro-cclib/blob/main/cclibScript.py
I added some support for metadata here: https://github.com/OpenChemistry/avogadro-cclib/blob/4a2524fb0c17ad72890a9c8fbe915e421ffa0171/cclibScript.py#L198
But IIRC I didn't add anything on the Avogadro side to parse it yet.
Suggestions welcome .. in particular, I'd like to add total charge and spin multiplicity soon, probably under "properties" to properly roundtrip files (e.g., for bond perception, etc.)
I'm very open to standardization, so there can be something relatively well defined when Avogadro 2.0 is released.
One relevant example from the chemicaljson repository (https://github.com/OpenChemistry/chemicaljson/blob/master/examples/water_psi4_oc.cjson)
has:
"inputParameters": {
"basis": "6-31g",
"functional": "b3lyp",
"task": "energy",
"theory": "dft"
},
This seems like an extremely good idea to enable users to "round trip" calculations. A few other useful metadata for the inputParameters
- [ ] dispersion correction type (D30, D3BJ, D4, ..)
- [ ] solvation model(?)
The "task" key is obviously not in metadata but most should be fairly easy to deduce from cclib:
- [ ] energy (single point = default)
- [ ] optimization (multiple optimization steps exist)
- [ ] frequencies (
vibfreqsexist) - [ ] transition state
- [ ] excitations (
etenergiesexist)
I'm also open to adding more "task" options.
These all seem like they'd be useful to pull out into the "inputParameters" key.. to be used by the Python input generator scripts.
The other metadata .. I'm not sure whether it's better to have something under "properties" or as you've proposed having a new key "metadata."
I went through the pull requests (#1143, #1142, #1141) and collected the data that they would want to introduce into the cjson, initially proposed as additions to a metadata section. As @ghutchis mentioned, many of these would fit in inputParameters field. There are a few that I am unclear on where they should be placed, I've indicated this with (?) following the data. Specifically, information about hardware related information.
basis_set (already exists inputParameters) calc_type (already exists in inputParameters, but would be renamed to theory) dispersion (add to inputParameters) functional (inputParameters) grid (?, this is a string describing the grid type used ) keywords_line (add to inputParameters) memory (?) processors (?) qm_program (maybe inputParameters) run_date (?) solvation (could be added to inputParameters)
Most of these seem useful for inputParameters. Way back when there was the JSON planning meeting, I suggested that the format needed to be good enough to work both as input and output - since many users might want a format as part of a workflow (e.g., post-calculation analysis, localized MOs, etc.). Besides that, most output formats include the input parameters.
We also discussed that "extensions" were probably useful / needed for specific program options. (For example, I'm not sure that Grid5 from Orca corresponds to anything specific in other programs.)
Let me clean up the CJSON repo, post the draft schema and I'll submit a patch here based on the avogadro-cclib reader.
I think my only question is the run_date which feels much more like metadata to me, not inputParameters or properties.
This sounds like a plan, thank you. I'll work on placing these keywords in the right place then.
I do agree that run_date does fit best in metadata.
Hi No pressure just bumping this. Have you already addressed Eric's review locally?