bio_data_guide icon indicating copy to clipboard operation
bio_data_guide copied to clipboard

[dataset]: DFO BioChem Plankton Data

Open EOGrady21 opened this issue 2 years ago • 9 comments

Contact details

[email protected]

Dataset Title

2023 AZMP Zooplankton Data

Describe your dataset and any specific challenges or blockers you have or anticipate.

Data is available in raw excel format with headers: mission, date, station, tow, gear ID, event ID, sample ID, depth, split, aliquot, taxa, stage, sex, count

The data are also available through an SQL database where there is additional metadata, but it would be preferred to load data from the raw spreadsheets.

The main hurdle to submission is formatting data into OBIS requirements. It would be ideal if there was an automated pipeline that could make formatting be a less resource intensive task.

Info about "raw" Data Files.

No response

EOGrady21 avatar Feb 01 '24 12:02 EOGrady21

After discussion with the workshop leaders, I decided to focus on a smaller 2022 dataset for my first publication BBMP2022_plankton.csv

I have an initial variable map:

BioChem DWC
MISSION_NAME  
MISSION_DESCRIPTOR eventID
PROTOCOL  
START_DATE_EVENT eventDate
START_DATE_HEAD  
COLLECTOR_STATION_NAME  
COLLECTOR_EVENT_ID eventID
COLLECTOR_COMMENT_EVENT eventRemarks
START_DEPTH minimumDepthInMeters
END_DEPTH maximumDepthInMeters
MESH_SIZE measurementType:mesh size
COLLECTOR_SAMPLE_ID eventID
COLLECTOR_HEADERS  
COLLECTOR_COMMENT_HEADERS
MIN_SIEVE measurementType: minimum sieve
MAX_SIEVE measurementType: maximum sieve
SPLIT_FRACTION measurementType: split fraction
NATIONAL_TAXONOMIC_SEQ  
COLLECTOR_TAXONOMIC_ID verbatimIdentification
TAXONOMIC_NAME  
MODIFIER  
STAGE  
MOLT_NUMBER measurementType: molt number
SEX sex
COUNTS individualCount
WET_WEIGHT measurementType: wet weight
DRY_WEIGHT measurementType: dry weight
COLLECTOR_COMMENT_GEN  
SOURCE  
CREATED_DATE  
PROD_CREATED_DATE  
MIN_LAT decimalLatitude
MAX_LAT  
MIN_LON decimalLongitude
LEADER  
PLATFORM  
START_DATE eventDate
END_DATE eventDate
PHASE_OF_DAYLIGHT  
SOUNDING measurementType: total bottom depth
VOLUME measurementType: volume
LARGE_PLANKTON_REMOVED  
COLLECTION_METHOD_NAME measurementType: collection method
PROCEDURE_NAME measurementType: procedure
VOLUME_METHOD_NAME measurementType: volume method
HEADER_START_LAT decimalLatitude
HEADER_END_LAT  
HEADER_START_LON decimalLongitude
HEADER_END_LON  
HEADER_END_TIME eventTime
HEADER_START_TIME eventTime
HEADER_END_DATE eventDate
LIFE_HISTORY_NAME lifeStage
BEST_NODC7  
EVENT_START_TIME eventTime
EVENT_END_TIME eventTime
EVENT_MIN_LON decimalLongitude
EVENT_MAX_LON  
EVENT_MIN_LAT decimalLatitude
EVENT_MAX_LAT  
EVENT_END_DATE eventDate
UTC_OFFSET  
GEAR_TYPE measurementType: gear type
GEAR_MODEL measurementType: gear model
GEAR_SIZE measurementType: gear size
TSN_ITIS  
AUTHORITY  
TSN  
APHIAID identificationID
PRESERVATION_NAME measurementType: preservation

I need to confirm the measurementType names, I see that it is recommended "to use a controlled vocabulary", but I'm not sure which vocabulary would encompass this very specific metadata.

I also note that a lot of metadata is not being translated, this is for simplicity. BioChem currently includes WoRMS, TSN and BioChem identifications. Some other metadata like location data has multiple points in BioChem (start and end points) which will be reduced to a single point in DarwinCore.

EOGrady21 avatar Apr 24 '24 12:04 EOGrady21

My goal is to develop a simple R package to process this dataset, due to the volume of data I hope to eventually push through. This will make the process as reproducible and efficient as possible.

The steps of processing will be:

  • [ ] pull from BioChem (linking to database and using standard SQL query)
  • [ ] organize data into event, occurence, emof tables
  • [ ] translate column names
  • [ ] fill additional columns based on existing data (connect to worrms to fill kingdom, phylum, genus, order, etc)
  • [ ] check data with obistools
  • [ ] export products for loading through IPT

EOGrady21 avatar Apr 24 '24 12:04 EOGrady21

@EOGrady21 give us a shout if you need any help!

MathewBiddle avatar Apr 24 '24 12:04 MathewBiddle

A more polished version of my column mapping:

OBIS BioChem notes tag
occurenceID   generate programmatically occurrence
basisOfRecord   FILL PROGRAMATICALLY[materialSample] occurrence
scientificName   FILL FROM APHIAID W WORRMS occurrence
scientificNameID APHIAID   occurrence
occurenceStatus   present or absent occurrence
verbatimIdentification COLLECTOR_TAXONOMIC_ID   occurrence
sex SEX add dwciri:sex and standardize occurrence
taxonRank   FILL FROM APHIAID W WORRMS occurrence
kingdom   FILL FROM APHIAID W WORRMS occurrence
phylum   FILL FROM APHIAID W WORRMS occurrence
class   FILL FROM APHIAID W WORRMS occurrence
order   FILL FROM APHIAID W WORRMS occurrence
family   FILL FROM APHIAID W WORRMS occurrence
genus   FILL FROM APHIAID W WORRMS occurrence
scientificNameAuthorship   FILL FROM APHIAID W WORRMS occurrence
lifeStage LIFE_HISTORY_NAME add dwciri:lifeStage and standardize occurrence
eventID MISSION_DESCRIPTOR eventType: cruise event
eventID COLLECTOR_EVENT_ID eventType: event event
eventDate   START_DATE - END_DATE, eventType: cruise event
eventDate START_DATE_EVENT eventType: event event
eventTime EVENT_START_TIME be sure to format with UTC_OFFSET event
decimalLatitude   FILL FROM MIN_LAT MAX_LAT USING OBISTOOLS::CALCULATE_CENTROID event
decimalLongitude   FILL FROM MIN_LON MAX_LON USING OBISTOOLS::CALCULATE_CENTROID event
geodeticDatum   WGS84 event
minimumDepthInMeters START_DEPTH   event
maximumDepthInMeters END_DEPTH   event
samplingProtocol   CONCATENATE GEAR_TYPE, GEAR MODEL, GEAR SIZE, PRESERVATION, collection_method_name, procedure event
eventRemarks COLLECTOR_COMMENT_EVENT event
individualCount COUNTS   emof
sampleSizeValue VOLUME needs sampleSizeUnit emof
measurementValue WET_WEIGHT measurementType: Zooplankton wet weight biomass, measurementTypeID:SDN:P02::GP079 emof
measurementType: dry weight DRY_WEIGHT measurementType: Zooplankton dry weight biomass per unit volume of the water column, measurementTypeID: SDN:P02::MSBD emof

Matched some of the measurements with P02 terms, note the reduction of metadata. This came from discussion with SME's about where the majority of scientific value is, this is a more manageable map that still gives high value information.

Next step, start coding a pipeline! :)

EOGrady21 avatar Apr 24 '24 15:04 EOGrady21