[dataset]: SE US MBON FL Keys Zooplankton Abundance Data Review
Dataset Name
MBON Florida Keys National Marine Sanctuary Zooplankton
Link to DwC Data Files
https://github.com/USF-IMARS/zoo-taxonomy-to-darwin-core/tree/master/data
Scripts Link
https://github.com/sebastiandig/obis_zooplankton_setup
Link to "raw" Data Files
No response
Describe your dataset and any specific requests.
Hi IOOS,
I have a zooplankton dataset that is ~80% of the converted to DarwinCore. This data set was sampled at 3 sites in the Florida Keys National Marine Sanctuary and 2 sites along a transect from a river mouth in the Everglades. These are cruises from mid 2017 to late 2020. The sites along the Florida Keys usually have 3 mesh sizes, 64 um, 200 um and 500 um, while the Everglades sites only has 64 um. Most of the analyzed data is for the 200 um and 500 um mesh. More data is being logged, but is not finalized as of yet. I have some scripts to ingest all the new data and add to the current DarwinCore formatted data (linked in form above) .
I am seeking approval/suggestions on the format of the event, occurrence and MoF files (linked in form above).
I would like feedback on:
- The column headers names
- Columns to add/subtract
- Order of columns
- Anything else to do before official submission
Best, Sebastian D.
Sebastian great start on this dataset. Here are some things I found that need to be addressed:
-
catalogNumberbelongs in the occurrence table and it should correspond to the label on the vial that's being stored. It was hard for me to assess what this number is since it's just one value in the event table. AlsobasisOfRecord,recordedBy, andrecordedByIDshould be dropped from the event table and only included in the occurrence table. There should be no overlap between fields in the event table and fields in the occurrence table. - You should drop
parentEventIDunless you will have information that will only be included at the parent level and not repeated in the child events. In this example you only have one event so it should not have a parentEventID. If you were planning to have a parent event and child event relationship in this table then in this example you would need to add another row that would be the parent event and it would not have anything inparentEventIDand theeventID= "IMaRS_MBON_zooplankton". Overall I find parent and child events to make things pretty confusing so it would probably be easiest to just dropparentEventID. - Since
basisOfRecord= "PreservedSpecimen" theinstitutionCodeshould match to the Global Registry of Scientific Collections and you should also includecollectionCode. I can help with adding an institution and/or collection to GRSciColl if needed. -
datasetNamelooks like a project name to me. I feel like you need to add something about zooplankton and depending on what's currently in this dataset and plans for the future for this dataset you might want to add place or other defining bits that make this dataset unique compared to other FL MBON datasets. - I am puzzled by
informationWithheldbecause this is not the type of information that is usually included there and especially because it's followed by a person's name and ID inrecordedByandrecordedByID. I would suggest to drop. - There is no term
eventDateTime, in facteventDateis the place to record date and time together. I don't normally includeeventTimesince it's already part ofeventDate. Further is the event time that you have in UTC? I can't tell based on what you've provided. If not, you need to add the time zone toeventDateor convert to UTC. -
occurrenceIDis not unique for each row in the data. There are 34 unique occurrenceIDs but 39 rows. - I'm not sure what information you are providing in
taxonID. I would recommend to drop since you havescientificNameIDunless you have a strong use case to keep it. -
individualCountshould only include whole numbers and it's not clear what these numbers represent. Maybe it's a count but it's a calculated count and I think you have that in the extended measurement or fact table so you can drop that one from the occurrence table. I would still keeporganismQuantityandorganismQuantityIDand also have it in the eMoF. -
measurementUnitIDis not in the occurrence extension so you need to drop that. It's only in the MoF extensions. -
establishmentMeanshas a controlled vocabulary and it looks like we are unsure so probably best to drop this unless you can be certain. Also if you do useestablismentMeansthen you are also supposed to includedegreeOfEstablishmentandpathway. - You should drop
datasetID,eventDateand any other fields that are already in the event core. - It's
georeferenceVerificationStatus - You should drop
datasetIDfrom the eMoF table, that term doesn't exist in that extension. - Any measurements or facts that are about the occurrences need to have
occurrenceIDadded. I think it's only "Abundance" and "Count" that would need to haveoccurrenceIDadded but if there are any others that relate specifically to the occurrence then you'll need to add it there too.
I wasn't able to fully review the event table since it was only one row. One thing to watch out for is that the eventID needs to be unique for each row in the event table. Terms that I would suggest adding are coordinateUncertaintyInMeters, minimumDepthInMeters, and maximumDepthInMeters.
Great job! Abby
Hi @albenson-usgs,
I finally found the time to edit the dataset. I was editing much of my code to fix some other issues, but your suggestions were faster to implement.
I am mostly done editing each of the sheets. @7yl4r and I went through your suggestions and made changes within. I have included 3 different events with their corresponding occurrence and MoF sheets here with files names
I think @7yl4r and I have decided that it may be worth while to add a collectionCode since we may add more data as they become available. @7yl4r and I think using institutionCode as USF should be okay. Would there be any advantage to have one specific for USF College of Marine Science?
I have a couple more suggestions to implement such as coordinateUncertaintyInMeters, minimumDepthInMeters, and maximumDepthInMeters. I need to check other logs to find these. The only one, I'm not certain of is how to determine the coordinateUncertaintyInMeters.
Other questions we have:
- With the
catalogNumber, we have the vials of preserved specimens, but we don't currently have an official way of labeling them other than the cruise info such as cruise ID, location, and mesh size. Do you have recommendations for labeling these or should we not worry about it at the moment? -
individualCountwas an average of 3 aliquots. Currently, we replaced with a total count of the 3 aliquots as addition. Does this work or should we keep it as an average and move it to the MoF only? -
datasetNamehas been updated to "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows (2017 - 2020)" which can have the years be dynamically updated with new data. Is this enough to be unique? The previous one that was submitted had the name "Time series of zooplankton abundance of the South Florida Program / Sanctuaries Marine Biodiversity Observation Network programs". Maybe a combination could work instead? - Lastly, how should I order the columns of each sheet?
Let me now what you think and I should be able to update this a bit faster.
Best, Sebastian D.
Sebastian would you mind bringing these questions to the SMBD meeting next week (March 8 at 4 ET)? There's a couple of them that I think would benefit from discussion from the group if you're amenable to that?
Sebastian is at sea this week; not sure if he will be back before the 8th so we may need to push back to the next SMBD meeting.
Hi @albenson-usgs,
I will still be out at sea during that time, but we will be on our way back and close to shore. Depending on how much internet we have, I may join the meeting, but probably not use my camera. If not, Tylar would be able to rely any information regarding this dataset.
Thanks for your continued help!
Best, Sebastian
Hi @sebastiandig - sorry to take so long to get back to you on this. The data looks great! Just a few things to note and I'll follow that with answers to your questions.
- Should the information that's in
recordedByandrecordedByIDactually be inidentifiedByandidentifiedByID? I missed this last time but I'm thinking this is the person that identified the specimens? Putting that info in identifiedBy terms means it will get picked up by Bionomia. - eventID is required in the EMoF table. It's what connects the extension (EMoF) to the core (event).
- measurementType is missing for several measurements. I'm assuming you're working on those ones but just wanted to make a note.
Now answers to your questions.
Would there be any advantage to have one specific for USF College of Marine Science?
We might need to have a call to work through the details on this. There are already two instances for the University of South Florida: this one and this one. The second one is using the code you have in the data (USF). If you use that code your observations will be linked to that institution which is obviously only for the USF Herbarium (many records in GRSciColl came from the Index Herbariorum) at the moment. Realistically in my mind USF should have a generalized institution record and the herbarium should be it's own collection and then your collection would be another separate collection. I'm not sure how easy this will all be to resolve. I can email the contact listed to try to work through this.
With the catalogNumber, we have the vials of preserved specimens, but we don't currently have an official way of labeling them other than the cruise info such as cruise ID, location, and mesh size. Do you have recommendations for labeling these or should we not worry about it at the moment?
I'm not sure what the best answer is here. I think the best thing would be to connect with a museum nearby to find out what they do. I would guess you want to label them sooner rather than later so the information is not forgotten/lost.
individualCount was an average of 3 aliquots. Currently, we replaced with a total count of the 3 aliquots as addition. Does this work or should we keep it as an average and move it to the MoF only?
We discussed this a bit at the last SMBD meeting and I don't think we came to consensus. Which one do you all use in your analyses? I would put that in organismQuantity and organismQuantityType and make sure to include the others in EMoF.
datasetName has been updated to "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows (2017 - 2020)" which can have the years be dynamically updated with new data. Is this enough to be unique? The previous one that was submitted had the name "Time series of zooplankton abundance of the South Florida Program / Sanctuaries Marine Biodiversity Observation Network programs". Maybe a combination could work instead?
This new one works well from my perspective but I would leave the years off since we would add more years in the future. I think "MBON Florida Keys National Marine Sanctuary Zooplankton Net Tows" is unique enough if that's an accurate description of the dataset.
Lastly, how should I order the columns of each sheet?
Order doesn't matter. The IPT will pick up the columns based on name no matter where they are in the spreadsheet.
Hi Abby,
I think we're pretty close to completing this. Here is the link for the updated data set example for June 13, 2023. This is only 5 eventIDs. There are a total of 87, but they are formatted exactly the same.
- [x] change
recordedByandrecordedByIDtoidentifiedByandidentifiedByID - [x] add
eventIDnext tooccurenceIDof theMoF - [x]
datasetNamemade to match the GBIF collection name - [x]
collectionCodeandinstitutionCodeupdated and added - [ ] missing
measurementType- for this, currently, we filtered out missing terms because they don't seem to exists in the The NERC Vocabulary Server
- [ ]
catalogNumber, we removed it for now - [ ]
individualCount, we left it as is because anything else would leave a fraction. For analysis, we tend to use density in individuals per cubic meter which was what I had originally, but doesn't seem to fit.
Best, Sebastian
Everything looks great! No changes to suggest. Just one small clarification. You say "individualCount, we left it as is because anything else would leave a fraction. For analysis, we tend to use density in individuals per cubic meter which was what I had originally, but doesn't seem to fit." But I don't see individualCount in this dataset, only organismQuantity and organismQuantityType. You can put fractions in organismQuantity (or percentages or scales or categories) because it is not strict with the object class. It's just individualCount that is strict about that.
Hi @albenson-usgs ,
Ah, okay that makes sense. I remember now that I changed individualCount to organismQuantity. Is individualCount a required field?
I think it might be better to change the current organismQuantityType to "individuals per cubic meter" since it makes more sense on our end to have a density. I will fix the values of organismQuantity as well. Would I still include this field in the eMoF or remove it?
Let me know if this makes sense and I will update this.
Best, Sebastian
individualCount is not a required field :-) Your plan sounds like a good one to me! It's up to you if you want to include it in both places. OBIS recommends doing that so if you already have it documented in the eMoF, it's ok to leave it as you have it.
Hi @albenson-usgs and @7yl4r ,
Since we the data is finally published, I think we can close this. Thank for all your help.
Best, Sebastian
That's great to hear!!!
Before closing, please include the links to the data on IPT, GBIF, OBIS, and the DOI.
Thanks!
Good call Sebastian! Thanks for keeping on top of this. OBIS-USA IPT: https://ipt-obis.gbif.us/resource?r=sfmbon_zooplankton GBIF: https://www.gbif.org/dataset/ec0d2fe8-21b1-4ab1-8b91-67873e8ca912 DOI: https://doi.org/10.15468/buqg4u OBIS: https://obis.org/dataset/afef5da2-614b-4208-aee6-c2413ed5ab76
@sebastiandig has fixed an error with abundance calculations in this dataset and need to update it in OBIS. @MathewBiddle : I think we need your help to push this to the IPT, correct?
Do you have log in credentials for the IPT https://ipt-obis.gbif.us/? If so, you should be able to upload the new data files and we can push out a new version. Did you add any additional columns or reorganize the data in any way?
Yes, we have log in credentials. Ok, we will do that Matt, thank you!
@MathewBiddle I was able to log in to the IPT but I don't have permission to edit the resource. We have the new version of the dataset ready to upload
@cperaltab you should have permissions now. https://ipt-obis.gbif.us/manage/resource.do?r=sfmbon_zooplankton
Please check
Thanks Matt, all good.