FSharp.Data icon indicating copy to clipboard operation
FSharp.Data copied to clipboard

New SDMX TypeProvider

Open demonno opened this issue 7 years ago • 4 comments

Since several SDMX standard-based data sources have emerged recently it would be useful to have a type provider supporting such data sources. The following exposes the current status of the effort of creating an SDMX TypeProvider. It is open to ideas and suggestions. I am very much looking forward to getting feedback from the FSharp.Data community to whether it would it be a good fit to have an SDMX type provider implementation in FSharp.Data.

There are many details to cover so the following will only list the simplest examples and provide references below for further details in case someone is interested.

Motivation

The amount of data available over SDMX is growing, the standard is a good fit for the type provider approach.

The goal

Implement the SdmxProvider which will support the simplest cases at the first step.

Background

SDMX - Statistical Data and Metadata eXchange gives a standardized way of exposing statistical databases as a web service, which provides all necessary metadata and extensive ways of querying the data. Currently, there are multiple implementations of SDMX standard which can be accessed publicly

Specification and WorldBank example

For simplicity, let's remember already familiar WorldBank TypeProvider from FSharpData and replicate the same scenario using SDMX, let's say we want to query annual agricultural land data in Germany.

WorldBank Provider

let wb = WorldBankData.GetDataContext()
let data = wb.Countries.Germany.Indicators.``Agricultural land (sq. km)``

SDMX Specification

Following steps describe how the same data can be queried using SDMX rest API.

Everything starts fromwsEntryPoint which in case of WorldBank is

  • https://api.worldbank.org/v2/sdmx/rest/

There are two major parts to this process, metadata and data retrieval.

Metadata

  • Retrieve all dataflows - https://api.worldbank.org/v2/sdmx/rest/dataflow/all/all/latest/
  • We choose WDI - World Development Indicators
  • Retrieve all WDI related metadata and datastructure information - https://api.worldbank.org/v2/sdmx/rest/datastructure/WB/WDI/1.0/?references=children
  • The previous step exposes information about existing data dimensions, in this case, there are 3 dimensions.
    • Frequency - [Annual, Montly, Quarterly, ...]
    • Series - [List of Indicators ... ]
    • Reference Area - [List of countries and regions .. ]

Data

Dimension information is used to create a query(key), we are looking for annual agricultural land data in Germany. To create such a key we build a sequence of dimension identifiers separated by a dots. (ordering matters).

  • A - Annual
  • AG_LND_AGRI_K2 - Agricultural land (sq. km)
  • DEU - Germany

Data query(key): A.AG_LND_AGRI_K2.DEU Finally, data is retrieved using the URL: https://api.worldbank.org/v2/sdmx/rest/data/WDI/A.AG_LND_AGRI_K2.DEU/

SDMX Provider

To query the same data from Wordlbank using SdmxProvider would look like following

type wb = SdmxProvider<"https://api.worldbank.org/v2/sdmx/rest/">
let data = wb.``World Development Indicators``.Annual.``Agricultural land (sq. km)``.Germany

Navigation using. (dots) should allow interaction on multiple levels. The initialization of TypeProvider will need initial configuration or static parameters which are

  • Protocol: Http or Https
  • EntryPoint: Rest API entry point URL
  • Credentials: In case of API is not publicly available

Foreseen issues

  • SDMX supports complex data, e.g. it is possible to choose multiple values from the single dimension. (Multiple countries or indicators) this will require some design decisions.
  • How to expose the SDMX ?queryparams that is used for additional filtering in the type provider?
  • Intermittent runtime errors
    • The provider needs to have a mechanism of retrying to fetch the data.

Additional features to be included:

  • Paging
  • Lazy Fetching
  • Async
  • other optimizations

References


Comments, ideas, suggestions are welcome. thanks

demonno avatar Sep 09 '18 19:09 demonno

Would be nice to be able to replace the WorldBank provider which is very specific with something like this that would generalize to other data sources, and I think a SDMX provider would fit nicely into FSharp.Data

ovatsus avatar Oct 07 '18 14:10 ovatsus

bumping this issue; this would make it much easier to create data science examples since the amount of data provided has grown significantly since this was created. any implementation tips would be appreciated it

ArmanAttaran avatar Jan 06 '21 08:01 ArmanAttaran

A prototype working implementation is in https://github.com/demonno/FSharp.Data fork. We'll try to finally create a pull request based on that work. There is support for SDMX protocol version 2.1. Some SDMX sources offer only SDMX 2.0 protocol and that part is still not yet implemented. The description on how the proposed solution works is described here: https://digikogu.taltech.ee/en/Item/47d2c178-2681-4aa5-9e25-23868a21c29b

juhan avatar Jan 06 '21 09:01 juhan

@juhan no need to implement 2.0; sdmx 3.0 is being released this year as well. Most places will move to a more modern version shortly.

ArmanAttaran avatar Jan 06 '21 17:01 ArmanAttaran