[dataset]: DFO BioChem Plankton Data
Contact details
Dataset Title
2023 AZMP Zooplankton Data
Describe your dataset and any specific challenges or blockers you have or anticipate.
Data is available in raw excel format with headers: mission, date, station, tow, gear ID, event ID, sample ID, depth, split, aliquot, taxa, stage, sex, count
The data are also available through an SQL database where there is additional metadata, but it would be preferred to load data from the raw spreadsheets.
The main hurdle to submission is formatting data into OBIS requirements. It would be ideal if there was an automated pipeline that could make formatting be a less resource intensive task.
Info about "raw" Data Files.
No response
After discussion with the workshop leaders, I decided to focus on a smaller 2022 dataset for my first publication BBMP2022_plankton.csv
I have an initial variable map:
| BioChem | DWC |
|---|---|
| MISSION_NAME | |
| MISSION_DESCRIPTOR | eventID |
| PROTOCOL | |
| START_DATE_EVENT | eventDate |
| START_DATE_HEAD | |
| COLLECTOR_STATION_NAME | |
| COLLECTOR_EVENT_ID | eventID |
| COLLECTOR_COMMENT_EVENT | eventRemarks |
| START_DEPTH | minimumDepthInMeters |
| END_DEPTH | maximumDepthInMeters |
| MESH_SIZE | measurementType:mesh size |
| COLLECTOR_SAMPLE_ID | eventID |
| COLLECTOR_HEADERS | |
| COLLECTOR_COMMENT_HEADERS | |
| MIN_SIEVE | measurementType: minimum sieve |
| MAX_SIEVE | measurementType: maximum sieve |
| SPLIT_FRACTION | measurementType: split fraction |
| NATIONAL_TAXONOMIC_SEQ | |
| COLLECTOR_TAXONOMIC_ID | verbatimIdentification |
| TAXONOMIC_NAME | |
| MODIFIER | |
| STAGE | |
| MOLT_NUMBER | measurementType: molt number |
| SEX | sex |
| COUNTS | individualCount |
| WET_WEIGHT | measurementType: wet weight |
| DRY_WEIGHT | measurementType: dry weight |
| COLLECTOR_COMMENT_GEN | |
| SOURCE | |
| CREATED_DATE | |
| PROD_CREATED_DATE | |
| MIN_LAT | decimalLatitude |
| MAX_LAT | |
| MIN_LON | decimalLongitude |
| LEADER | |
| PLATFORM | |
| START_DATE | eventDate |
| END_DATE | eventDate |
| PHASE_OF_DAYLIGHT | |
| SOUNDING | measurementType: total bottom depth |
| VOLUME | measurementType: volume |
| LARGE_PLANKTON_REMOVED | |
| COLLECTION_METHOD_NAME | measurementType: collection method |
| PROCEDURE_NAME | measurementType: procedure |
| VOLUME_METHOD_NAME | measurementType: volume method |
| HEADER_START_LAT | decimalLatitude |
| HEADER_END_LAT | |
| HEADER_START_LON | decimalLongitude |
| HEADER_END_LON | |
| HEADER_END_TIME | eventTime |
| HEADER_START_TIME | eventTime |
| HEADER_END_DATE | eventDate |
| LIFE_HISTORY_NAME | lifeStage |
| BEST_NODC7 | |
| EVENT_START_TIME | eventTime |
| EVENT_END_TIME | eventTime |
| EVENT_MIN_LON | decimalLongitude |
| EVENT_MAX_LON | |
| EVENT_MIN_LAT | decimalLatitude |
| EVENT_MAX_LAT | |
| EVENT_END_DATE | eventDate |
| UTC_OFFSET | |
| GEAR_TYPE | measurementType: gear type |
| GEAR_MODEL | measurementType: gear model |
| GEAR_SIZE | measurementType: gear size |
| TSN_ITIS | |
| AUTHORITY | |
| TSN | |
| APHIAID | identificationID |
| PRESERVATION_NAME | measurementType: preservation |
I need to confirm the measurementType names, I see that it is recommended "to use a controlled vocabulary", but I'm not sure which vocabulary would encompass this very specific metadata.
I also note that a lot of metadata is not being translated, this is for simplicity. BioChem currently includes WoRMS, TSN and BioChem identifications. Some other metadata like location data has multiple points in BioChem (start and end points) which will be reduced to a single point in DarwinCore.
My goal is to develop a simple R package to process this dataset, due to the volume of data I hope to eventually push through. This will make the process as reproducible and efficient as possible.
The steps of processing will be:
- [ ] pull from BioChem (linking to database and using standard SQL query)
- [ ] organize data into event, occurence, emof tables
- [ ] translate column names
- [ ] fill additional columns based on existing data (connect to worrms to fill kingdom, phylum, genus, order, etc)
- [ ] check data with obistools
- [ ] export products for loading through IPT
@EOGrady21 give us a shout if you need any help!
A more polished version of my column mapping:
| OBIS | BioChem | notes | tag |
|---|---|---|---|
| occurenceID | generate programmatically | occurrence | |
| basisOfRecord | FILL PROGRAMATICALLY[materialSample] | occurrence | |
| scientificName | FILL FROM APHIAID W WORRMS | occurrence | |
| scientificNameID | APHIAID | occurrence | |
| occurenceStatus | present or absent | occurrence | |
| verbatimIdentification | COLLECTOR_TAXONOMIC_ID | occurrence | |
| sex | SEX | add dwciri:sex and standardize | occurrence |
| taxonRank | FILL FROM APHIAID W WORRMS | occurrence | |
| kingdom | FILL FROM APHIAID W WORRMS | occurrence | |
| phylum | FILL FROM APHIAID W WORRMS | occurrence | |
| class | FILL FROM APHIAID W WORRMS | occurrence | |
| order | FILL FROM APHIAID W WORRMS | occurrence | |
| family | FILL FROM APHIAID W WORRMS | occurrence | |
| genus | FILL FROM APHIAID W WORRMS | occurrence | |
| scientificNameAuthorship | FILL FROM APHIAID W WORRMS | occurrence | |
| lifeStage | LIFE_HISTORY_NAME | add dwciri:lifeStage and standardize | occurrence |
| eventID | MISSION_DESCRIPTOR | eventType: cruise | event |
| eventID | COLLECTOR_EVENT_ID | eventType: event | event |
| eventDate | START_DATE - END_DATE, eventType: cruise | event | |
| eventDate | START_DATE_EVENT | eventType: event | event |
| eventTime | EVENT_START_TIME | be sure to format with UTC_OFFSET | event |
| decimalLatitude | FILL FROM MIN_LAT MAX_LAT USING OBISTOOLS::CALCULATE_CENTROID | event | |
| decimalLongitude | FILL FROM MIN_LON MAX_LON USING OBISTOOLS::CALCULATE_CENTROID | event | |
| geodeticDatum | WGS84 | event | |
| minimumDepthInMeters | START_DEPTH | event | |
| maximumDepthInMeters | END_DEPTH | event | |
| samplingProtocol | CONCATENATE GEAR_TYPE, GEAR MODEL, GEAR SIZE, PRESERVATION, collection_method_name, procedure | event | |
| eventRemarks | COLLECTOR_COMMENT_EVENT | event | |
| individualCount | COUNTS | emof | |
| sampleSizeValue | VOLUME | needs sampleSizeUnit | emof |
| measurementValue | WET_WEIGHT | measurementType: Zooplankton wet weight biomass, measurementTypeID:SDN:P02::GP079 | emof |
| measurementType: dry weight | DRY_WEIGHT | measurementType: Zooplankton dry weight biomass per unit volume of the water column, measurementTypeID: SDN:P02::MSBD | emof |
Matched some of the measurements with P02 terms, note the reduction of metadata. This came from discussion with SME's about where the majority of scientific value is, this is a more manageable map that still gives high value information.
Next step, start coding a pipeline! :)