bio_data_guide icon indicating copy to clipboard operation
bio_data_guide copied to clipboard

[dataset]: PMN

Open jennifermaucher-pmn opened this issue 4 months ago • 4 comments

Contact details

[email protected]

Dataset Title

PMN historical data

Describe your dataset and any specific challenges or blockers you have or anticipate.

I am working with phytoplankton monitoring and data are posted on ERDDAP. I am TOTALLY NEW to all of this so I am hoping not to drown to quickly :) Ideally I would like to link our data to OBIS and other relevant databases.

Info about "raw" Data Files.

https://www.ncei.noaa.gov/erddap/tabledap/bedi_PMN.html

jennifermaucher-pmn avatar Sep 25 '25 20:09 jennifermaucher-pmn

486k records!! Holy smokes. This is fantastic. Since the data are already accessible on ERDDAP, we can develop a small script to do the reformatting.

First and foremost, I would start with creating a mapping table of all the species (variable spec_name) to an appropriate WoRMS identifier (I like to use the python package pyworms for this). The workshop will cover this in the data cleaning portion, but you can take a look ahead of time at the materials, if you're interested in getting a head start.

If interested, here is a small python code snippet to check your species names against WoRMS. Since there are 737 unique species names in the column, it will take a bit to run.

import pandas as pd

df = pd.read_csv('https://www.ncei.noaa.gov/erddap/tabledap/bedi_PMN.csvp')

for spec_name in df['spec_name'].dropna().unique():
  print(f'Searching for {spec_name}.')
  resp = pyworms.aphiaRecordsByMatchNames(spec_name)

  if len(resp[0]) == 1:
    print(f'   found {spec_name} = AphiaID {resp[0][0]['AphiaID']} {resp[0][0]['url']}')
  elif len(resp[0]) == 0:
    print(f'   {spec_name} not found.')
  else:
    print(f'   found {len(resp[0])} matches for {spec_name}.')

For the ones that don't match 1:1 on WoRMS, we'll need to do some sleuthing. For example, Entomonesis matches to four records at WoRMS. We'll have to decide which one is the appropriate mapping.

https://www.marinespecies.org/aphia.php?p=taxlist&tid=-1&tName=Entomonesis&searchpar=0&tComp=begins&action=search&rSkips=0&marine=1&fossil=4

Once you have the appropriate WoRMS identifier for each species, the rest should be a straightforward change of the column names.

You could probably create an event for each sampling site and date pair, then add the abiotic information as extended measurement or fact. But, we can discuss that at the workshop.

MathewBiddle avatar Sep 26 '25 12:09 MathewBiddle

Thanks for the materials ahead of time, I'll need all the prep I can get! Like I said, I am total newbie/old dog learning new tricks but I will try to get a jump on this. I took a python class a few years ago but never did anything with it, so the code snippet is super helpful to get me back into that language.

Hopefully any gov't shutdown will be of short duration so I can actually work on it ahead of time. We are also about to get whacked by a hurricane next week as well here in Charleston. Perfect storm?!?!?

Thanks again!

Jen

On Fri, Sep 26, 2025 at 8:13 AM Mathew Biddle @.***> wrote:

MathewBiddle left a comment (ioos/bio_data_guide#323) https://github.com/ioos/bio_data_guide/issues/323#issuecomment-3338386573

486k records!! Holy smokes. This is fantastic. Since the data are already accessible on ERDDAP, we can develop a small script to do the reformatting.

First and foremost, I would start with creating a mapping table of all the species (variable spec_name) to an appropriate WoRMS identifier (I like to use the python package pyworms https://pyworms.readthedocs.io/en/latest/ for this). The workshop will cover this in the data cleaning portion, but you can take a look ahead of time at the materials https://ioos.github.io/bio_mobilization_workshop/03-data-cleaning.html#matching-your-scientific-names-to-worms, if you're interested in getting a head start.

If interested, here is a small python code snippet to check your species names against WoRMS. Since there are 737 unique species names in the column, it will take a bit to run.

import pandas as pd df = pd.read_csv('https://www.ncei.noaa.gov/erddap/tabledap/bedi_PMN.csvp') for spec_name in df['spec_name'].dropna().unique(): print(f'Searching for {spec_name}.') resp = pyworms.aphiaRecordsByMatchNames(spec_name)

if len(resp[0]) == 1: print(f' found {spec_name} = AphiaID {resp[0][0]['AphiaID']} {resp[0][0]['url']}') elif len(resp[0]) == 0: print(f' {spec_name} not found.') else: print(f' found {len(resp[0])} matches for {spec_name}.')

For the ones that don't match 1:1 on WoRMS, we'll need to do some sleuthing. For example, Entomonesis matches to four records at WoRMS. We'll have to decide which one is the appropriate mapping.

https://www.marinespecies.org/aphia.php?p=taxlist&tid=-1&tName=Entomonesis&searchpar=0&tComp=begins&action=search&rSkips=0&marine=1&fossil=4

Once you have the appropriate WoRMS identifier for each species, the rest should be a straightforward change of the column names.

You could probably create an event for each sampling site and date pair, then add the abiotic information as extended measurement or fact. But, we can discuss that at the workshop.

— Reply to this email directly, view it on GitHub https://github.com/ioos/bio_data_guide/issues/323#issuecomment-3338386573, or unsubscribe https://github.com/notifications/unsubscribe-auth/BX5BL4O5B4YVE5XRPZUTMLL3UUUURAVCNFSM6AAAAACHQZLIT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZYGM4DMNJXGM . You are receiving this because you authored the thread.Message ID: @.***>

-- Jennifer Maucher Fuquay, M.S. Program Coordinator Phytoplankton Monitoring Network NOAA Charleston Lab 331 Fort Johnson Rd Charleston, SC 29412 ***** ANOTHER NEW PHONE #**** *‪(843) 560-9143‬ (they really mean it this time)

jennifermaucher-pmn avatar Sep 26 '25 13:09 jennifermaucher-pmn

Also, take a look at the Wilkinson Basin Zooplankton Timeseries dataset mobilization process: https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON

There is probably a lot of similarities between the two.

MathewBiddle avatar Sep 26 '25 19:09 MathewBiddle

Thank you!! I will!

On Fri, Sep 26, 2025 at 3:07 PM Mathew Biddle @.***> wrote:

MathewBiddle left a comment (ioos/bio_data_guide#323) https://github.com/ioos/bio_data_guide/issues/323#issuecomment-3340111039

Also, take a look at the Wilkinson Basin Zooplankton Timeseries dataset mobilization process: https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON

There is probably a lot of similarities between the two.

— Reply to this email directly, view it on GitHub https://github.com/ioos/bio_data_guide/issues/323#issuecomment-3340111039, or unsubscribe https://github.com/notifications/unsubscribe-auth/BX5BL4IITS4HV3UZP75ZB3D3UWFITAVCNFSM6AAAAACHQZLIT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBQGEYTCMBTHE . You are receiving this because you authored the thread.Message ID: @.***>

-- Jennifer Maucher Fuquay, M.S. Program Coordinator Phytoplankton Monitoring Network NOAA Charleston Lab 331 Fort Johnson Rd Charleston, SC 29412 ***** ANOTHER NEW PHONE #**** *‪(843) 560-9143‬ (they really mean it this time)

jennifermaucher-pmn avatar Sep 26 '25 19:09 jennifermaucher-pmn