hard-coded tables in CMOR
Various tables are included in the software, which forces the user to upgrade CMOR if they are updated. In particular, the following:
$ cd cmor/
$ find . -name '*.json' | grep -v Test
./Lib/CV_experiments.json
./Lib/experiments_id.json
./LibCV/PrePARE/out_names_tests.json
Could I suggest moving them to a separate repo so that they can be updated independently? (Possibly added to https://github.com/PCMDI/cmip6-cmor-tables, although I don't know enough to be sure whether this is appropriate or not.)
Thanks.
@alaniwi thanks for commenting. We are actually in the process of cleaning up repos (see PCMDI/xml-cmor3-database#52) and @mauzey1 has also flagged that CMOR should be carrying the primary Tables repo (PCMDI/cmip6-cmor-tables) along for the ride in conda installations, see #529.
Where other than tests are the files referenced https://github.com/PCMDI/cmor/issues/536#issue-482220704 actually used? In order for a user to configure and use CMOR3 they first need to clone and configure a Table subdir for their use (either cmip6-cmor-tables, input4mips-cmor-tables or obs4mips-cmor-tables currently).
@durack1 Issue #529 is about making a git submodule of cmip6-cmor-tables inside the CMOR repo, not including it in the conda installation.
@alaniwi The file out_names_tests.json is used by PrePARE for finding the variable of a file by looking at their "out name," which is a truncated version of the variable name used in the name of the output file. out_names_tests.json is not a part of the CMIP6 tables. The other two files appear to not be used by CMOR/PrePARE and will probably be removed. @durack1 and @taylor13, do you know the purpose of experiments_id.json and CV_experiments.json?
@mauzey1 good question, those two experiments files are not up-to-date and as long as they're not being used by the code should also be purged as part of a repo cleanup. This repo should be for the CMOR software (along with it's internal tests) not for also hosting table files
@durack1 The specific issue I encountered was related to PrePARE using the out_names_tests.json file.
@durack1 Should out_names_tests.json be a part of the CMIP6 CMOR tables repo? If a change were to happen in the CMIP6 tables that would require changing out_names_tests.json, then we would only need to update the table repo instead of both the tables and CMOR.
@mauzey1 it is not clear to me what the out_name_tests.json file actually contains. It would appear to be a look up table of sorts for contents of the cmip6-cmor-tables and if this is true, and is not CMOR-specific, then moving this to the cmip6-cmor-tables repo makes complete sense to me
@durack1 It is CMOR-specific since the index values in each table-out_name entry corresponds to a function in PrePARE for testing which variable name to use.
I am thinking we might not even need out_name_tests.json. We could look up the out_names in the table we get from the file name, and then determine which test we need to pick a variable name.
@durack1 I wanted to get back to this issue of hard-coded tables in CMOR/PrePARE. out_name_tests.json is the last of these files.
As I stated previously, we shouldn't need this file since the out_name attributes from the variable entries in the tables should handle it. The current version of PrePARE will get the table and variable out_name from the file name of the dataset, look up which variable property to check in out_name_tests.json , and then use that check to determine which variable name it should be.
An example would be a dataset for ta27 in the 6hrPlevPt table. PrePARE would get the out_name ta and the table name 6hrPlevPt from the dataset's file name. PrePARE would then concatenate those names into 6hrPlevPt_ta and then use that as a key to look up in out_names_tests.json. It will see that it will have to determine if the variable name is ta27 or ta7h using the has_27_pressure_levels and has_7_pressure_levels functions respectively.
https://github.com/PCMDI/cmor/blob/c805fe0fcc509bf4ebed2024e193a009d791f9df/LibCV/PrePARE/out_names_tests.json#L27-L30
https://github.com/PCMDI/cmor/blob/c805fe0fcc509bf4ebed2024e193a009d791f9df/LibCV/PrePARE/PrePARE.py#L340-L348
https://github.com/PCMDI/cmor/blob/c805fe0fcc509bf4ebed2024e193a009d791f9df/LibCV/PrePARE/PrePARE.py#L408-L434
PrePARE will determine that the dataset should have the variable name ta27 if it has 27 pressure levels.
I propose a different method of validating the file name and variable name. First, find the variable and table name in the file to find the variable entry in the table.
"ta27": {
"frequency": "6hrPt",
"modeling_realm": "atmos",
"standard_name": "air_temperature",
"units": "K",
"cell_methods": "area: mean time: point",
"cell_measures": "area: areacella",
"long_name": "Air Temperature",
"comment": "Air Temperature",
"dimensions": "longitude latitude plev27 time1",
"out_name": "ta",
"type": "real",
"positive": "",
"valid_min": "",
"valid_max": "",
"ok_min_mean_abs": "",
"ok_max_mean_abs": ""
}
From there we can get the out_name attribute to validate the name used in the file, and we can also perform the has_27_pressure_levels check due to plev27 being present in the dimensions list. We could do similar checks with plev4 and plev7h. is_climatology is about finding -clim in the file name and Clim at the end of the variable name, and has_land_in_cell_methods is about finding land in the variable name and in the cell_methods attribute. has_3_dimensions is about checking if there are 3 dimensions for the variable. We might not even need these checks if the CMOR CV already handles them.
One issue when it comes to validating this feature is the lack of files from ESGF that have an out name that differs from their true variable name. Going through the out names list and searching for variables on esgf-node.llnl.gov, I've only found one dataset: CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.piControl.r1i1p1f2.Omon.ficeberg2d.gn
One odd thing about this dataset is that the variable name ficeberg2d is the same as the name in the file name, rather than ficeberg as out_names_tests.json would suggest.
Soon we should be moving to a slightly different way of uniquely naming variables, so that there won't be multiple "in names" in a table with the same "out name". In fact the names will be unique across all tables (although the variable may still be divided up and hosted by different tables). I'm not sure we should try to clean up things until that new approach has been agreed.