echopype icon indicating copy to clipboard operation
echopype copied to clipboard

Enhance update_platform to support more use cases and produce more consistent results

Open emiliom opened this issue 3 years ago • 4 comments

The use case that drove the development and testing of update_platform was a single set of Saildrone data files from the Pacific Hake survey, with one EK80 raw file with no Platform data at all and an associated external data file.

I illustrate and document some of the issues in this notebook.

General comments about the current implementation of update_platform

  • It fails with the use case of a single fixed location created as an xarray Dataset on the fly with no other variables
  • It handles latitude, longitude, pitch, roll, vertical_offset and water_level
  • It currently adds a history attribute only to time1.history. Here's an example: "2022-06-25 21:32:53.788634 +00:00. Added from external platform data". I think this same attribute string should also be added to all variables inserted by update_platform
  • time1 is used for all inserted variables. It is the only dimension assigned to these variables. That pattern predates echopype 0.6.0, and for variables other than latitude and longitude the resulting Platform group ends up being inconsistent with a Platform group generated using open_raw where the platform data are found in the raw file. Specifically, echopype 0.6.0 specifies the following dimensionality:
    pitch: (channel, time2)
    roll: (channel, time2),
    vertical_offset: (channel, time2)
    water_level: (channel, time3)
    
  • It sets all missing target variables to nan except water_level, which is set to 0. I didn't find any valid reason for the water_level assignment of 0, and I believe the only reason was to prevent the failure of tests/echodata/test_echodata.py::test_update_platform as currently implemented.

Some open questions about the fixed-location use case, such as from a mooring

  • What should the timestamp on the location be? eg, the first or last ping_time value? When using a single timestamp (time1 will have a length of 1), the assumption is that it applies to the entire dataset (all ping times).
  • Would it be clearer to replicate the fixed location across more than one timestamp? For example, the first and last ping_time?

Issues unrelated to update_platform per se

Comparing the outcome of update_platform for AZFP and EK60 raw files that don't have the variables handled by update_platform but do have other Platform variables exposed some remaining differences in how variables with empty values are being handled by the different set_groups_* classes. These differences are illustrated in my sample notebook, and they involve issues such as:

  • Whether a variable is created when empty (eg, latitude and longitude are not created in set_groups_azfp, but they are in set_groups_ek60 and set_groups_ek80)
  • Default values of 0 or nan
  • Assigned dimensions

Also, set_groups_ek60 does not add the 3 empty global platform_* attributes, whereas set_groups_ek80 and set_groups_azfp do.

Finally, set_groups_azfp is not assigning any attributes to the AZFP-specific variables tilt_x and tilt_y .

emiliom avatar Jun 25 '22 21:06 emiliom

See PR #741, which addressed a few of the tasks from this issue. In particular, the simple fixed-location use case where a single lat-lon coordinate is passed to update_platform. The choice of which timestamp to associate with it was left to the user.

emiliom avatar Jun 26 '22 20:06 emiliom

Some open questions about the fixed-location use case, such as from a mooring

  • What should the timestamp on the location be? eg, the first or last ping_time value? When using a single timestamp (time1 will have a length of 1), the assumption is that it applies to the entire dataset (all ping times).
  • Would it be clearer to replicate the fixed location across more than one timestamp? For example, the first and last ping_time?

I think it will be clearer to use just the first ping_time for this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the first ping_time of each file seems more intuitive. We can perhaps add in the comment that the lat/lon are added through update_platform and use the first ping_time as the timestamp.

leewujung avatar Jun 29 '22 15:06 leewujung

I think it will be clearer to use just the first ping_time for this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the first ping_time of each file seems more intuitive.

I hadn't thought of that case, and it's a good point. Right now this decision (whether to use a single point with the first ping_time or two points) is external to update_platform, but I'm thinking that we could build in -- now or down the road -- a special handling of the fixed-location case such that the user could pass a single lat,lon coordinate; then update_platform would implement the assumption (eg, single point timestamped to the first ping_time).

We can perhaps add in the comment that the lat/lon are added through update_platform and use the first ping_time as the timestamp.

In this issue I've included a suggestion to add the history attribute to every variable that's updated (inserted/rewritten) by update_platform. That would take care of the first part of your suggestion. The second part, the use of the first ping_time as the timestamp, could be added there too (for simplicity); or as a comment attribute.

Overall, I like your suggestions and reasoning.

emiliom avatar Jun 29 '22 16:06 emiliom

For version 0.6.1 (short time frame), I think we can focus largely on the fixed-location use case and the issues and clean ups I've already addressed in PR #741. We can leave the other comments I've raised here to 0.6.2.

emiliom avatar Jun 29 '22 16:06 emiliom

Just wanted to note that this is related to #769.

leewujung avatar Mar 09 '23 05:03 leewujung

Now #1009 supersede #769.

leewujung avatar May 17 '23 23:05 leewujung

I've gone over this issue, related PR's, and the outcome of update_platform in an example. I think everything here has been addressed. The only exception is the issue now tracked in #1009; we'll discuss and track its progress there and can close this issue.

There is one usability improvement mentioned here that is not implemented:

I think it will be clearer to use just the first ping_time for this. I am thinking of the case when we combine multiple files, say 10, seeing 10 identical lat/lon with the time corresponding to the first ping_time of each file seems more intuitive.

I hadn't thought of that case, and it's a good point. Right now this decision (whether to use a single point with the first ping_time or two points) is external to update_platform, but I'm thinking that we could build in -- now or down the road -- a special handling of the fixed-location case such that the user could pass a single lat,lon coordinate; then update_platform would implement the assumption (eg, single point timestamped to the first ping_time).

This would be nice, but let's create a new issue solely for it. We can then address it in a later release.

emiliom avatar Aug 03 '23 22:08 emiliom

Now that I've opened #1126, we close this issue.

emiliom avatar Aug 16 '23 23:08 emiliom