param icon indicating copy to clipboard operation
param copied to clipboard

Support for deserialization of file types into Array, DataFrame

Open sdrobert opened this issue 4 years ago • 7 comments

This PR follows #382, #470 as well as discussion with @jbednar and @jlstevens.

param currently models deserialization routines after serialization routines. This is fine if you're only ever deserializing something that has previously been automatically serialized. However, my use case (and I believe that of plenty of others) focuses on reading in hand-written config files for e.g. specifying neural network parameters. In such cases, writing a Series, Array, or DataFrame could involve dumping a giant list of numbers by hand into the configuration file.

A better solution would be to specify a path to a data file in the configuration, which, when deserialized, is quietly parsed into the value. The reference to the file can be thrown away and just the value stored in the parameter. The type of file and thus the routine for parsing is inferred by the file extension. In the deserialize() method of the relevant functions, before interpreting the value as arguments to a constructor (either a ndarray or DataFrame), we: first check if it matches a file on disk; then if it matches an extension we know; and then, if Numpy or Pandas has a routine to read that extension, we call it.

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations. The user can always overload or subclass the relevant Parameters if she needs a special method of deserialization.

I have so far handled the easiest file types: those with both read and write routines in either numpy or pandas. Some require extra dependencies. If those dependencies were easily installed, I added guards in the test file and added them to the test environment in tox.ini. For numpy, these are

  • .npy (Numpy archive)
  • .txt[.gz] (Numpy text file)

For pandas, these are

  • .csv (comma-separated)
  • .dta (stata)
  • .feather
  • .json
  • .ods (OpenOffice sheet)
  • .parquet
  • .pkl (pickle)
  • .tsv (tab-separated)
  • .xls{m,x} (Excel sheet)

Pandas has a lot more I/O routines that are trivial to add but harder to test. The two glaring absences are:

  • .hdf5
  • .xls (Pre-2007 Excel)

Pandas does have a write routine for ".xls" but its backend has been deprecated. HDF5 support relies on PyTables. PyTables is easily installed on Conda but has only a source distribution on PyPI. To install that source distribution, you need access to hdf5's headers.

I hope this is a good starting point for this functionality.

Thank you for your time, Sean

sdrobert avatar May 28 '21 22:05 sdrobert

Nice! I like this a lot, are you also anticipating implementing the other direction and adding serialization?

philippjfr avatar May 29 '21 21:05 philippjfr

@philippjfr In my use case there isn't a need to serialize back to file. @jbednar and @jlstevens bandied the idea back and forth and I think came to the conclusion that, whereas deserialization can be performed transparently from file to value, serialization would need some mechanism to tether the Parameter value to a specific file/encoding. This PR as-is makes no changes to param's API (except perhaps removing the ability to deserialize an existing path as an array of characters - which was probably not intended in the first place).

sdrobert avatar May 30 '21 22:05 sdrobert

Right; at this point it only covers deserialization, and fully supporting transparent roundtripping (ensuring the file type doesn't change in the process) sounds complicated. Still, I think we'll be able to review and merge this and add serialization later. I'd guess the filetype won't be preserved, e.g. a .csv might turn into .parquet, which is what Intake does for caching, but that seems ok to me.

jbednar avatar May 31 '21 04:05 jbednar

Looks great!

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations.

I realize this is still WIP but I've made one comment that I think would help users when the file fails to load: essentially, it would be nice to state which extensions are supported for that type.

jlstevens avatar May 31 '21 09:05 jlstevens

Thanks for the review jlstevens. Letting users know what file types are supported is a really good idea.

I also wanted to mention a quiet bug that I'm glossing over right now in case you want me to handle it differently. Python 2.7 supports only up to Pandas 0.24. In that version, pandas.read_excel did not support .ods files. I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files. A more correct solution would involve checking the Pandas version and excluding the file type appropriately, or nixing the type altogether. That said, it's Python 2.7. I'm sure there are also minimum version requirements to Pandas (pre 1.0) and Numpy that I've overlooked as well.

sdrobert avatar May 31 '21 16:05 sdrobert

I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files.

I think we can just mention this in the release notes. Even though param will probably support 2.7 for a while, many of our downstream projects are now switching to Python 3. At this point, it isn't critical if there are a few holes in the Python 2 support.

jlstevens avatar Jun 02 '21 19:06 jlstevens

I think I've addressed all the comments above that I wasn't blocked on. I'm waiting on a consensus about the global list of extensions and the python 3.6/pandas 1.1 stuff.

sdrobert avatar Jun 09 '21 22:06 sdrobert

@jlstevens do you think you'll have time to review this for 2.0 or prefer to postpone that?

maximlt avatar Apr 05 '23 10:04 maximlt

I think this probably should go in param 2.0 but it would be good to have a little time to test this change as well.

If you could fix the merge conflict, I'll do a quick review then merge.

jlstevens avatar Apr 05 '23 10:04 jlstevens

@jlstevens I fixed the merge conflict.

maximlt avatar Apr 16 '23 20:04 maximlt

@maximlt Sorry for the trouble! It's been a busy few weeks.

sdrobert avatar Apr 19 '23 21:04 sdrobert

@maximlt Sorry for the trouble! It's been a busy few weeks.

Oh no worries :) We're pushing to get Param 2.0 out, hence the recent movement on this PR and others.

maximlt avatar Apr 21 '23 08:04 maximlt

@jlstevens fixed the conflicts that were introduced recently after a few big merged PRs.

maximlt avatar May 04 '23 00:05 maximlt

@sdrobert , even after all this time, I'm still missing a key bit of the intended use case and motivation, because there still aren't any examples of actual usage. As best I can tell, since there is no serialization implemented, what this PR will address is someone who writes their own JSON file and wants to specify a filename rather than the actual contents of the DataFrame or Array. Can you give us any example of an actual JSON file that would be used in this way? I consider JSON to be a read-only format, and would never edit it by hand since that just leads to file-format errors, but I understand that editing JSON can be feasible for some people in an editor like VSCode that has better support than what I use. Is that really the intended use case? Directly authoring JSON? If so we need to include an example in the docs of doing that, or no one will ever use this functionality.

jbednar avatar May 16 '23 22:05 jbednar

@jbednar, to answer your immediate question, my use case has always been machine learning. You can find an example here in my supplementary library to param (N.B. this is not a plug; I would rather all the functionality exist in param so I could sunset my library). This example does not contain any arrays or data frames, but may easily and plausibly be augmented to include one, e.g.

{
  "training": {
    "lr": 1e-05,
    "max_epochs": 10,
    "model_regex": "model-{epoch:05d}.pkl"
  },
  "model": {
    "activations": "relu",
    "layers": [
      "conv",
      "conv",
      "fc"
    ],
    "mean": "mean.npy",
    "std": "std.npy"
  }
}

Where mean.npy and std.npy point to files where, e.g., feature means and standard deviations reside. When deserializing from file, the mean and std params are populated with the contents of the file. This configuration can be specified by hand; other methods would be tedious/impossible.

DataFrame parameters could provide an easy means of dynamically specifying training sets for smaller ML tasks, e.g. for scikit-learn routines. It could also be used similarly to script visualization routines for e.g., seaborn or possibly HoloViz, avoiding notebooks.

I'm not really sure how to answer the other question about modifying JSON by hand. I agree that JSON is unwieldy, which is why I have also implemented YAML (de)serialization in my repo and am willing to make a PR for it here. Based on the framing of the question, however, it seems perhaps that the JSON (de)serialization mechanism in param is intended solely for machine consumption? There doesn't appear to be a standard means of getting parameters into a Parameterized instance beyond manipulating them after instantiation programmatically. This is a broader question than that of just arrays and data frames. If the team doesn't see much value in ingesting manual configurations overall, this PR isn't going to do much. I'm sorry for the noise at that point.

sdrobert avatar May 16 '23 23:05 sdrobert

That's perfect, thanks! Indeed, we do see much use for a declarative YAML spec, and now it makes sense. This PR is only half of what's needed, which is what was confusing me! Ok, now we can move forward. Thanks.

jbednar avatar May 17 '23 02:05 jbednar

Given that you've reviewed this extensively @jbednar, I'll let you decide whether you want to hit merge.

maximlt avatar May 25 '23 20:05 maximlt