bio_data_guide icon indicating copy to clipboard operation
bio_data_guide copied to clipboard

[dataset]: SE US MBON FL Keys Zooplankton Abundance Data Review

Open sebastiandig opened this issue 3 years ago • 19 comments

Dataset Name

MBON Florida Keys National Marine Sanctuary Zooplankton

Link to DwC Data Files

https://github.com/USF-IMARS/zoo-taxonomy-to-darwin-core/tree/master/data

Scripts Link

https://github.com/sebastiandig/obis_zooplankton_setup

Link to "raw" Data Files

No response

Describe your dataset and any specific requests.

Hi IOOS,

I have a zooplankton dataset that is ~80% of the converted to DarwinCore. This data set was sampled at 3 sites in the Florida Keys National Marine Sanctuary and 2 sites along a transect from a river mouth in the Everglades. These are cruises from mid 2017 to late 2020. The sites along the Florida Keys usually have 3 mesh sizes, 64 um, 200 um and 500 um, while the Everglades sites only has 64 um. Most of the analyzed data is for the 200 um and 500 um mesh. More data is being logged, but is not finalized as of yet. I have some scripts to ingest all the new data and add to the current DarwinCore formatted data (linked in form above) .

I am seeking approval/suggestions on the format of the event, occurrence and MoF files (linked in form above).

I would like feedback on:

  1. The column headers names
  2. Columns to add/subtract
  3. Order of columns
  4. Anything else to do before official submission

Best, Sebastian D.

sebastiandig avatar Nov 23 '22 18:11 sebastiandig

Sebastian great start on this dataset. Here are some things I found that need to be addressed:

  1. catalogNumber belongs in the occurrence table and it should correspond to the label on the vial that's being stored. It was hard for me to assess what this number is since it's just one value in the event table. Also basisOfRecord, recordedBy, and recordedByID should be dropped from the event table and only included in the occurrence table. There should be no overlap between fields in the event table and fields in the occurrence table.
  2. You should drop parentEventID unless you will have information that will only be included at the parent level and not repeated in the child events. In this example you only have one event so it should not have a parentEventID. If you were planning to have a parent event and child event relationship in this table then in this example you would need to add another row that would be the parent event and it would not have anything in parentEventID and the eventID = "IMaRS_MBON_zooplankton". Overall I find parent and child events to make things pretty confusing so it would probably be easiest to just drop parentEventID.
  3. Since basisOfRecord = "PreservedSpecimen" the institutionCode should match to the Global Registry of Scientific Collections and you should also include collectionCode. I can help with adding an institution and/or collection to GRSciColl if needed.
  4. datasetName looks like a project name to me. I feel like you need to add something about zooplankton and depending on what's currently in this dataset and plans for the future for this dataset you might want to add place or other defining bits that make this dataset unique compared to other FL MBON datasets.
  5. I am puzzled by informationWithheld because this is not the type of information that is usually included there and especially because it's followed by a person's name and ID in recordedBy and recordedByID. I would suggest to drop.
  6. There is no term eventDateTime, in fact eventDate is the place to record date and time together. I don't normally include eventTime since it's already part of eventDate. Further is the event time that you have in UTC? I can't tell based on what you've provided. If not, you need to add the time zone to eventDate or convert to UTC.
  7. occurrenceID is not unique for each row in the data. There are 34 unique occurrenceIDs but 39 rows.
  8. I'm not sure what information you are providing in taxonID. I would recommend to drop since you have scientificNameID unless you have a strong use case to keep it.
  9. individualCount should only include whole numbers and it's not clear what these numbers represent. Maybe it's a count but it's a calculated count and I think you have that in the extended measurement or fact table so you can drop that one from the occurrence table. I would still keep organismQuantity and organismQuantityID and also have it in the eMoF.
  10. measurementUnitID is not in the occurrence extension so you need to drop that. It's only in the MoF extensions.
  11. establishmentMeans has a controlled vocabulary and it looks like we are unsure so probably best to drop this unless you can be certain. Also if you do use establismentMeans then you are also supposed to include degreeOfEstablishment and pathway.
  12. You should drop datasetID, eventDate and any other fields that are already in the event core.
  13. It's georeferenceVerificationStatus
  14. You should drop datasetID from the eMoF table, that term doesn't exist in that extension.
  15. Any measurements or facts that are about the occurrences need to have occurrenceID added. I think it's only "Abundance" and "Count" that would need to have occurrenceID added but if there are any others that relate specifically to the occurrence then you'll need to add it there too.

I wasn't able to fully review the event table since it was only one row. One thing to watch out for is that the eventID needs to be unique for each row in the event table. Terms that I would suggest adding are coordinateUncertaintyInMeters, minimumDepthInMeters, and maximumDepthInMeters.

Great job! Abby

AbbyBenson avatar Nov 28 '22 23:11 AbbyBenson

Hi @albenson-usgs,

I finally found the time to edit the dataset. I was editing much of my code to fix some other issues, but your suggestions were faster to implement.

I am mostly done editing each of the sheets. @7yl4r and I went through your suggestions and made changes within. I have included 3 different events with their corresponding occurrence and MoF sheets here with files names example_update.csv.

I think @7yl4r and I have decided that it may be worth while to add a collectionCode since we may add more data as they become available. @7yl4r and I think using institutionCode as USF should be okay. Would there be any advantage to have one specific for USF College of Marine Science?

I have a couple more suggestions to implement such as coordinateUncertaintyInMeters, minimumDepthInMeters, and maximumDepthInMeters. I need to check other logs to find these. The only one, I'm not certain of is how to determine the coordinateUncertaintyInMeters.

Other questions we have:

  1. With the catalogNumber, we have the vials of preserved specimens, but we don't currently have an official way of labeling them other than the cruise info such as cruise ID, location, and mesh size. Do you have recommendations for labeling these or should we not worry about it at the moment?
  2. individualCount was an average of 3 aliquots. Currently, we replaced with a total count of the 3 aliquots as addition. Does this work or should we keep it as an average and move it to the MoF only?
  3. datasetName has been updated to "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows (2017 - 2020)" which can have the years be dynamically updated with new data. Is this enough to be unique? The previous one that was submitted had the name "Time series of zooplankton abundance of the South Florida Program / Sanctuaries Marine Biodiversity Observation Network programs". Maybe a combination could work instead?
  4. Lastly, how should I order the columns of each sheet?

Let me now what you think and I should be able to update this a bit faster.

Best, Sebastian D.

sebastiandig avatar Feb 27 '23 05:02 sebastiandig

Sebastian would you mind bringing these questions to the SMBD meeting next week (March 8 at 4 ET)? There's a couple of them that I think would benefit from discussion from the group if you're amenable to that?

AbbyBenson avatar Mar 02 '23 16:03 AbbyBenson

Sebastian is at sea this week; not sure if he will be back before the 8th so we may need to push back to the next SMBD meeting.

7yl4r avatar Mar 03 '23 17:03 7yl4r

Hi @albenson-usgs,

I will still be out at sea during that time, but we will be on our way back and close to shore. Depending on how much internet we have, I may join the meeting, but probably not use my camera. If not, Tylar would be able to rely any information regarding this dataset.

Thanks for your continued help!

Best, Sebastian

sebastiandig avatar Mar 06 '23 22:03 sebastiandig

Hi @sebastiandig - sorry to take so long to get back to you on this. The data looks great! Just a few things to note and I'll follow that with answers to your questions.

  1. Should the information that's in recordedBy and recordedByID actually be in identifiedBy and identifiedByID? I missed this last time but I'm thinking this is the person that identified the specimens? Putting that info in identifiedBy terms means it will get picked up by Bionomia.
  2. eventID is required in the EMoF table. It's what connects the extension (EMoF) to the core (event).
  3. measurementType is missing for several measurements. I'm assuming you're working on those ones but just wanted to make a note.

Now answers to your questions.

Would there be any advantage to have one specific for USF College of Marine Science?

We might need to have a call to work through the details on this. There are already two instances for the University of South Florida: this one and this one. The second one is using the code you have in the data (USF). If you use that code your observations will be linked to that institution which is obviously only for the USF Herbarium (many records in GRSciColl came from the Index Herbariorum) at the moment. Realistically in my mind USF should have a generalized institution record and the herbarium should be it's own collection and then your collection would be another separate collection. I'm not sure how easy this will all be to resolve. I can email the contact listed to try to work through this.

With the catalogNumber, we have the vials of preserved specimens, but we don't currently have an official way of labeling them other than the cruise info such as cruise ID, location, and mesh size. Do you have recommendations for labeling these or should we not worry about it at the moment?

I'm not sure what the best answer is here. I think the best thing would be to connect with a museum nearby to find out what they do. I would guess you want to label them sooner rather than later so the information is not forgotten/lost.

individualCount was an average of 3 aliquots. Currently, we replaced with a total count of the 3 aliquots as addition. Does this work or should we keep it as an average and move it to the MoF only?

We discussed this a bit at the last SMBD meeting and I don't think we came to consensus. Which one do you all use in your analyses? I would put that in organismQuantity and organismQuantityType and make sure to include the others in EMoF.

datasetName has been updated to "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows (2017 - 2020)" which can have the years be dynamically updated with new data. Is this enough to be unique? The previous one that was submitted had the name "Time series of zooplankton abundance of the South Florida Program / Sanctuaries Marine Biodiversity Observation Network programs". Maybe a combination could work instead?

This new one works well from my perspective but I would leave the years off since we would add more years in the future. I think "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows" is unique enough if that's an accurate description of the dataset.

Lastly, how should I order the columns of each sheet?

Order doesn't matter. The IPT will pick up the columns based on name no matter where they are in the spreadsheet.

AbbyBenson avatar Mar 21 '23 20:03 AbbyBenson

Hi Abby,

I think we're pretty close to completing this. Here is the link for the updated data set example for June 13, 2023. This is only 5 eventIDs. There are a total of 87, but they are formatted exactly the same.

  • [x] change recordedBy and recordedByID to identifiedBy and identifiedByID
  • [x] add eventID next to occurenceID of the MoF
  • [x] datasetName made to match the GBIF collection name
  • [x] collectionCode and institutionCode updated and added
  • [ ] missing measurementType
  • [ ] catalogNumber, we removed it for now
  • [ ] individualCount, we left it as is because anything else would leave a fraction. For analysis, we tend to use density in individuals per cubic meter which was what I had originally, but doesn't seem to fit.

Best, Sebastian

sebastiandig avatar Jun 14 '23 01:06 sebastiandig

Everything looks great! No changes to suggest. Just one small clarification. You say "individualCount, we left it as is because anything else would leave a fraction. For analysis, we tend to use density in individuals per cubic meter which was what I had originally, but doesn't seem to fit." But I don't see individualCount in this dataset, only organismQuantity and organismQuantityType. You can put fractions in organismQuantity (or percentages or scales or categories) because it is not strict with the object class. It's just individualCount that is strict about that.

AbbyBenson avatar Jun 22 '23 19:06 AbbyBenson

Hi @albenson-usgs ,

Ah, okay that makes sense. I remember now that I changed individualCount to organismQuantity. Is individualCount a required field?

I think it might be better to change the current organismQuantityType to "individuals per cubic meter" since it makes more sense on our end to have a density. I will fix the values of organismQuantity as well. Would I still include this field in the eMoF or remove it?

Let me know if this makes sense and I will update this.

Best, Sebastian

sebastiandig avatar Jun 27 '23 20:06 sebastiandig

individualCount is not a required field :-) Your plan sounds like a good one to me! It's up to you if you want to include it in both places. OBIS recommends doing that so if you already have it documented in the eMoF, it's ok to leave it as you have it.

AbbyBenson avatar Jun 27 '23 20:06 AbbyBenson

Hi @albenson-usgs and @7yl4r ,

Since we the data is finally published, I think we can close this. Thank for all your help.

Best, Sebastian

sebastiandig avatar Oct 12 '23 00:10 sebastiandig

That's great to hear!!!

Before closing, please include the links to the data on IPT, GBIF, OBIS, and the DOI.

Thanks!

MathewBiddle avatar Oct 12 '23 12:10 MathewBiddle

Good call Sebastian! Thanks for keeping on top of this. OBIS-USA IPT: https://ipt-obis.gbif.us/resource?r=sfmbon_zooplankton GBIF: https://www.gbif.org/dataset/ec0d2fe8-21b1-4ab1-8b91-67873e8ca912 DOI: https://doi.org/10.15468/buqg4u OBIS: https://obis.org/dataset/afef5da2-614b-4208-aee6-c2413ed5ab76

AbbyBenson avatar Oct 12 '23 14:10 AbbyBenson

@sebastiandig has fixed an error with abundance calculations in this dataset and need to update it in OBIS. @MathewBiddle : I think we need your help to push this to the IPT, correct?

event.csv occur.csv mof.csv

7yl4r avatar Sep 05 '25 21:09 7yl4r

Do you have log in credentials for the IPT https://ipt-obis.gbif.us/? If so, you should be able to upload the new data files and we can push out a new version. Did you add any additional columns or reorganize the data in any way?

MathewBiddle avatar Sep 08 '25 11:09 MathewBiddle

Yes, we have log in credentials. Ok, we will do that Matt, thank you!

cperaltab avatar Sep 18 '25 19:09 cperaltab

@MathewBiddle I was able to log in to the IPT but I don't have permission to edit the resource. We have the new version of the dataset ready to upload

cperaltab avatar Sep 29 '25 18:09 cperaltab

@cperaltab you should have permissions now. https://ipt-obis.gbif.us/manage/resource.do?r=sfmbon_zooplankton

Please check

MathewBiddle avatar Sep 30 '25 12:09 MathewBiddle

Thanks Matt, all good.

cperaltab avatar Sep 30 '25 22:09 cperaltab