gcamdata icon indicating copy to clipboard operation
gcamdata copied to clipboard

Feature Request -

Open ssmithClimate opened this issue 6 years ago • 9 comments

It would be useful to have the capability to add data to gcamdata without explicitly hard coding every input file as we need to do now. This would be useful, for example, where there might be a large number of input files with different metadata (for example, air pollution standards for new electric plants for various countries), or where we might want to re-generate GCAM XML's for alternative data sources (e.g., HDD and CDD and/or ag crop yield changes from a specific climate model).

The idea is to have a specific folder for a particular type of data where data files could be added and they would be then added into the system without changing any code. We do this now in CEDS, and its proved to be a very useful way to add specific data to the system without changing code. The system scans the specified "drop folder" when it runs and complies all the relevant data and incorporates it into the processing sequence. This has the advantage also that, if different pieces of data come from different sources, each data file can have its own meta data documenting the source of that specific part of the data.

We have two specific projects where this would be useful in the very near-term. One is the updated vintage-based emissions controls for electric power plants introduced at the 1/13/19 GCAM modeling group meeting. A second is work we need to do over the next couple months to incorporate air pollutant emissions into GCAM-USA. Having this concept in place would greatly simplify coding for both of these . Input files could then be separated by data source (and perhaps emission species) to keep the input modular, and could more easily be updated. For example, for the US, we could put in a file with the new source performance standards for new electric power plants (or other sectors) that are part of US regulations, and document the source of that. We could put in a different file with the new standards for China, and the source for that information. A third file might contain assumptions from GAINS for other regions/countries. That makes the system flexible - we can add new data anytime as it becomes available, and users could also easily tailor those assumptions for their specific studies.

Here are two potential ways we might do this.

I. One is to have a chunk that just processes the files in a specific drop folder and produces an output file (e.g. user_emissions_controls.csv) that just is a compilation of the information in the individual files. That file (e.g. user_emissions_controls.csv) would then be the specified input for the chunk that actually does the data processing from there. One issue here is how to trigger an XML re-build if new data is added.

II. Another approach could be to dig into the file dependency mechanisms in GCAMdata and make it possible to specify a folder as an input. That could more easily allow gcamdata to know it should look for new files in that folder to trigger a re-build of any chunks with that folder as an input. This also would be cleaner in being a "one step" solution (thanks @bpbond for e-mail comments).

Comments? @pkyle @pralitp @brinday

ssmithClimate avatar May 17 '19 16:05 ssmithClimate

(I have a bad habit of accidentally hitting return to early...)

ssmithClimate avatar May 17 '19 16:05 ssmithClimate

Re (II):

Under the hood, in find_csv_file() the code uses system.file() to find a chunk-requested specified file; this can handle requests for folders as well, so no problem there.

We'd need to refactor load_csv_files() a bit (specifically lines 66-80) to check (dir.exists()) whether the object being requested is a file or folder, and if the latter, get all csv files within it, adding to the object metadata at each step.

We're assuming that all files in a folder will have the same structure and so can be sensibly row-bound?

bpbond avatar May 17 '19 16:05 bpbond

A quick response: Yes we have been aware of the need for some new capabilities to better facilitate generating / running sensitives: #1080

Generally, I don't think we want user sensitivities to be generated inside of gcamdata (unless it is capability that needs to be added to the Core although even then, just creating every permutation is increasingly unfeasible.) because it is harder to track what are just the Core assumptions vs your own. Instead we would want users to "shim" (#12) their new data into the processing from outside of the package. And ideally we could generate some way of identifying the alternative scenarios and how it was generated. Perhaps going as far as having gcamdata produce the GCAM configuration file itself.

pralitp avatar May 17 '19 16:05 pralitp

Requiring that files have the same structure is reasonable. If we have cases where file structure might be slightly different there could just be different folders (for example, we have some emission controls that are specified relative to GDP/capita, and others that are specified by year. Those could be two different drop folders named appropriately - pollutant-em-control-gdp, pollutant-em-control-year).

ssmithClimate avatar May 17 '19 17:05 ssmithClimate

Note that in the applications I mentioned above these would be core model inputs.

ssmithClimate avatar May 17 '19 17:05 ssmithClimate

I responded above to this from a very narrow technical viewpoint, just thinking about our previous email exchange. But the more I read back through this post, the more complicated and not very possible it seems. Aside, possibly, from the narrow use case of a folder of structurally identical files that are periodically updated by dropping in a new file, the capabilities described above would greatly complicate the gcamdata package, and run counter to many of its fundamental design principles (e.g. explicitly specified inputs are the norm).

@pralitp 's note about the 'shim' capability seems spot on to me; I also agree that we need better flexibility in a number of areas.

So–if Pralit and I are misunderstanding something, Steve, absolutely let's chat in person. But otherwise it's tough to see this capability (particularly in the more expansive form your outline above) being a priority for gcamdata.

bpbond avatar May 17 '19 18:05 bpbond

Pralit and I talked. I'll put together some example input files to help clarify what might be needed.

ssmithClimate avatar May 17 '19 19:05 ssmithClimate

Attached are some example files. These are just for one species (SO2) and one sector, but hopefully help illustrate the capacity that would be useful.

GCAMdata_input_Examples.zip

ssmithClimate avatar May 20 '19 22:05 ssmithClimate

This seems related to something I discussed with Ben at the 2018 GCAM workshop regarding automating modifications to the data system.

It would be great to be able to specify a location for modified source files to be used in place of the standard ones (e.g., same structure as reference system but sparsely populated), and a destination for the generated xml directory. I could then generate all the XML into a new location without having multiple copies of all the input files. I can easily manage just the modified files in my project repo.

An additional improvement would be to be able to build just the subset of the XML dependent on a given set of CSV and/or R files, i.e., those found in the "modified sources" directory.

rjplevin avatar Jan 20 '20 22:01 rjplevin