Enhance update_platform to support more use cases and produce more consistent results
The use case that drove the development and testing of update_platform was a single set of Saildrone data files from the Pacific Hake survey, with one EK80 raw file with no Platform data at all and an associated external data file.
I illustrate and document some of the issues in this notebook.
General comments about the current implementation of update_platform
- It fails with the use case of a single fixed location created as an xarray Dataset on the fly with no other variables
- It handles
latitude,longitude,pitch,roll,vertical_offsetandwater_level - It currently adds a
historyattribute only totime1.history. Here's an example: "2022-06-25 21:32:53.788634 +00:00. Added from external platform data". I think this same attribute string should also be added to all variables inserted byupdate_platform -
time1is used for all inserted variables. It is the only dimension assigned to these variables. That pattern predates echopype 0.6.0, and for variables other thanlatitudeandlongitudethe resulting Platform group ends up being inconsistent with a Platform group generated usingopen_rawwhere the platform data are found in the raw file. Specifically, echopype 0.6.0 specifies the following dimensionality:pitch: (channel, time2) roll: (channel, time2), vertical_offset: (channel, time2) water_level: (channel, time3) - It sets all missing target variables to nan except
water_level, which is set to 0. I didn't find any valid reason for thewater_levelassignment of 0, and I believe the only reason was to prevent the failure oftests/echodata/test_echodata.py::test_update_platformas currently implemented.
Some open questions about the fixed-location use case, such as from a mooring
- What should the timestamp on the location be? eg, the first or last
ping_timevalue? When using a single timestamp (time1will have a length of 1), the assumption is that it applies to the entire dataset (all ping times). - Would it be clearer to replicate the fixed location across more than one timestamp? For example, the first and last
ping_time?
Issues unrelated to update_platform per se
Comparing the outcome of update_platform for AZFP and EK60 raw files that don't have the variables handled by update_platform but do have other Platform variables exposed some remaining differences in how variables with empty values are being handled by the different set_groups_* classes. These differences are illustrated in my sample notebook, and they involve issues such as:
- Whether a variable is created when empty (eg,
latitudeandlongitudeare not created inset_groups_azfp, but they are inset_groups_ek60andset_groups_ek80) - Default values of 0 or
nan - Assigned dimensions
Also, set_groups_ek60 does not add the 3 empty global platform_* attributes, whereas set_groups_ek80 and set_groups_azfp do.
Finally, set_groups_azfp is not assigning any attributes to the AZFP-specific variables tilt_x and tilt_y .
See PR #741, which addressed a few of the tasks from this issue. In particular, the simple fixed-location use case where a single lat-lon coordinate is passed to update_platform. The choice of which timestamp to associate with it was left to the user.
Some open questions about the fixed-location use case, such as from a mooring
- What should the timestamp on the location be? eg, the first or last
ping_timevalue? When using a single timestamp (time1will have a length of 1), the assumption is that it applies to the entire dataset (all ping times).- Would it be clearer to replicate the fixed location across more than one timestamp? For example, the first and last
ping_time?
I think it will be clearer to use just the first ping_time for this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the first ping_time of each file seems more intuitive. We can perhaps add in the comment that the lat/lon are added through update_platform and use the first ping_time as the timestamp.
I think it will be clearer to use just the first
ping_timefor this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the firstping_timeof each file seems more intuitive.
I hadn't thought of that case, and it's a good point. Right now this decision (whether to use a single point with the first ping_time or two points) is external to update_platform, but I'm thinking that we could build in -- now or down the road -- a special handling of the fixed-location case such that the user could pass a single lat,lon coordinate; then update_platform would implement the assumption (eg, single point timestamped to the first ping_time).
We can perhaps add in the
commentthat the lat/lon are added through update_platform and use the first ping_time as the timestamp.
In this issue I've included a suggestion to add the history attribute to every variable that's updated (inserted/rewritten) by update_platform. That would take care of the first part of your suggestion. The second part, the use of the first ping_time as the timestamp, could be added there too (for simplicity); or as a comment attribute.
Overall, I like your suggestions and reasoning.
For version 0.6.1 (short time frame), I think we can focus largely on the fixed-location use case and the issues and clean ups I've already addressed in PR #741. We can leave the other comments I've raised here to 0.6.2.
Just wanted to note that this is related to #769.
Now #1009 supersede #769.
I've gone over this issue, related PR's, and the outcome of update_platform in an example. I think everything here has been addressed. The only exception is the issue now tracked in #1009; we'll discuss and track its progress there and can close this issue.
There is one usability improvement mentioned here that is not implemented:
I think it will be clearer to use just the first
ping_timefor this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the firstping_timeof each file seems more intuitive.I hadn't thought of that case, and it's a good point. Right now this decision (whether to use a single point with the first ping_time or two points) is external to
update_platform, but I'm thinking that we could build in -- now or down the road -- a special handling of the fixed-location case such that the user could pass a single lat,lon coordinate; thenupdate_platformwould implement the assumption (eg, single point timestamped to the first ping_time).
This would be nice, but let's create a new issue solely for it. We can then address it in a later release.
Now that I've opened #1126, we close this issue.