pycsw THREDDS catalog harvesting support

will be great to add to pycsw the ability to harvest a THREDDS catalog.

this enanchment can be used and tested during the OSGeo-OSSIM GSOC 2013 [ http://trac.osgeo.org/ossim/wiki/GSoC_2013_Ideas ]

A short example :

the main link is the catalog : http://geoport.whoi.edu/thredds/catalog.html

its "xml" representation http://geoport.whoi.edu/thredds/catalog.xml

it is the main tree that harvest all the other "subcatalogs" in the "tag" : catalogRef

see : http://geoport.whoi.edu/thredds/catalog.xml where the [catalogRef] point to other "sub-catalogs"

... until the final dataset that has no [catalogRef] is reached (the leaf)

So : http://geoport.whoi.edu/thredds/global_bathy.xml is the leaf and in its xml we have a tag and for each dataset there is a <urlPath>

Attaching to the leaf.html the string : ?dataset=<urlPath> will give the final link from where to retrieve wms getcapabilites or nectdf link etc ..

The final link to the datset can be then used to load the data with "python-netdf4" https://code.google.com/p/netcdf4-python/

To deal with infinite loops/recursion, a maxrecords parameter can be used, adding a check that a catalog with the same url cant give more than one results with same metadata

this might help understand THREDDS :

http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/v1.0.2/InvCatalogSpec.html
and a one page overview http://www.unidata.ucar.edu/publications/factsheets/2007sheets/threddsFactSheet-1.doc

Jun 07 '13 17:06 epifanio

@epifanio thanks for details. @kwilcox @rsignell any comments/thoughts? Are there any Python THREDDS catalog parsers out there?

Jun 08 '13 17:06 tomkralidis

I've written a simple one I use to pull out ISO files from THREDDS catalogs: https://github.com/asascience-open/glos_catalog/blob/master/pyiso/pyiso/collectors/thredds.py

I don't remember why I ended up using the THREDDS HTML pages over the XML catalog pages.

I'm new to pycsw... is there some level of metadata you are looking to get out of each "dataset" in THREDDS, or are you just looking for the service endpoints that THREDDS provides on top of each dataset?

Jun 08 '13 18:06 kwilcox

@kwilcox thanks for the info. In CSW we can harvest both data and services and link them by association, and have parent/child relationships. So in theory we can both.

A first pass could be leaf datasets.

Jun 09 '13 13:06 tomkralidis

@tomkralidis this might be a good start, comments welcome: https://github.com/asascience-open/thredds_crawler

Jun 12 '13 13:06 kwilcox

FYI, from IRC:

[12:33] tomkralidis basically, this maps to, in pycsw, pycsw.server.harvest
[12:34] tomkralidis which basically does some checking/fetching then inserting/updated.
[12:34] tomkralidis the root of the action is in https://github.com/geopython/pycsw/blob/master/pycsw/metadata.py
[12:35] tomkralidis where, depending on the harvest type (WMS, WFS, WAF, another CSW, etc.), a dedicated parser is written.
[12:35] tomkralidis for example, https://github.com/geopython/pycsw/blob/master/pycsw/metadata.py#L623
[12:35] tomkralidis here pycsw uses OWSLib to fetch/parse the Capabilities into an OWSLib object (in this case SOS)
[12:36] tomkralidis and does all those _set() functions to map the OWSLib object properties to pycsw's internal model of things.
[12:37] tomkralidis so the pycsw.metadata._parse_* functions are the core functions which do the work.
[12:37] tomkralidis luckily, we have https://github.com/asascience-open/thredds_crawler which already does the parsing/etc. for us.
[12:37] tomkralidis so it's a matter of looping each object from kwilcox' lib and mapping into a pycsw rec. done.
[12:39] tomkralidis there is some dressing to do, of course (dependency on thredds_crawler, registering THREDDS in pycsw's harvesting model, etc., and documentation), but those are minor minor efforts.

Nov 27 '13 18:11 tomkralidis

Higher level items to address:

add thredds_crawler==0.6-dev to https://github.com/geopython/pycsw/blob/master/setup.py#L119 to be fetched via setup.py / PyPI
add THREDDS namespace http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.1 to the end of the list in https://github.com/geopython/pycsw/blob/master/pycsw/server.py#L2345, this tells pycsw to allow THREDDS Harvesting
add THREDDS to the support in docs at https://github.com/geopython/pycsw/blob/master/docs/transactions.rst
add to pycsw.metadata.parse_record:


    elif mtype == 'http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.1': # THREDDS
        LOGGER.debug('THREDDS Catalog detected, fetching via thredds_crawler')
        return _parse_thredds(context, repos, record, identifier)  # returns list of metadata objects

Nov 27 '13 23:11 tomkralidis

FYI test request would be:

<?xml version="1.0" encoding="UTF-8"?>
<Harvest xmlns="http://www.opengis.net/cat/csw/2.0.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-publication.xsd" service="CSW" version="2.0.2">
  <Source>http://geoport.whoi.edu/thredds/bathy_catalog.xml</Source>
  <ResourceType>http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0</ResourceType>
  <ResourceFormat>application/xml</ResourceFormat>
</Harvest>

Dec 02 '13 02:12 tomkralidis

The pathway we have been promoting so far for searching THREDDS catalogs has been: (1) using the ncISO service on THREDDS or a separate ncISO process pointing at a THREDDS catalog to harvest ISO metadata into a catalog service with a database (GI-CAT, Geonetwork, Geoportal Server, CKAN).
(2) access the CSW services from (GI-CAT, Geonetwork, Geoportal Server, CKAN).

Just to make sure I understand, this enhancement issue is about accessing metadata from THREDDS catalogs using the NcISO service into pyCSW?

Dec 02 '13 14:12 rsignell-usgs

@rsignell-usgs yes, this enhancement is about harvesting a THREDDS catalog endpoint into pycsw. The questions we're having now are around granularity/associations, and how much / how deep / approach to harvesting THREDDS catalogs.

Dec 02 '13 15:12 tomkralidis

Thanks for the explanation. I think I get it now.

So are you harvesting the metadata using the ncISO service in THREDDS, or "rolling your own"?
If rolling your own, it would be great if the ACDD metadata (http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_(ACDD)) conventions were used in the same way that ncISO uses them to create ISO metadata. Or maybe you are all over this already...

-Rich

Dec 02 '13 16:12 rsignell-usgs

@rsignell-usgs thanks. Do all THREDDS catalogs have all datasets documented in ISO? If yes, then it's easier just to pick off dataset as an ISO document (I'm not familiar with THREDDS catalogs).

Dec 02 '13 16:12 tomkralidis

No, it's not guaranteed that THREDDS catalog will have the ncISO service running that provides ISO metadata for each dataset. In addition, if there is an ncISO service, it may be an ancient version. So usually folks harvesting metadata from THREDDS catalogs run the latest ncISO http://www.ngdc.noaa.gov/eds/tds/, https://geo-ide.noaa.gov/wiki/index.php?title=NcISO to ensure the latest ACDD = > ISO mapping

Dec 02 '13 17:12 rsignell-usgs