data-act-pilot icon indicating copy to clipboard operation
data-act-pilot copied to clipboard

Open standards for data schemas and serialization formats

Open bsweger opened this issue 10 years ago • 45 comments

Opening this issue to track thoughts on open source data schemas/standards/formats.

General Questions

  • How are people in the world implementing open-source data standards/schemas and moving data around?
  • Do special considerations exist for (non-PII, non-confidential) federal spending govt. data standards that distinguish this space from data standard implementations in other industries?
  • How important is it to build validations/metadata into a schema versus having a separate validation process and metadata definitions?
  • Related: what are the pros and cons of building a data input/validation process tied to a particular schema syntax vs. de-coupling syntax from validations/metadata?

Questions for specific data standard/schema options:

  • How robust is its open-source ecosystem? Has the open source community developed tooling around it? If so, how actively are the tools updated/maintained, and how large is the contributor base? Will users incur licensing fees or get locked into a vendor?
  • How widely adopted is the data standard/schema? Is it limited to a particular domain?
  • How easy is it for data submitters/producers (both biz and tech) to use?
  • How easy is it for data consumers (both biz and tech) to use?
  • What resources exist for non-coders to understand the data standard/schema? Is that even relevant if good, robust metadata exists (see above)?
  • What's the process for updating and tracking changes to the schema/standard?

These are schemas/standards/formats we know about. What are others?

bsweger avatar Aug 27 '15 16:08 bsweger

Taking a first stab at sussing out what the various tools and libraries are attempting to solve. Will update as I learn more.

Name Defining Structure/Type Structure is Extensible Data Serialization Data Validation Human-Friendly
XML/XBRL y y y
Protocol Buffers y ? y y y
JSON y
Avro y y? y y sort of
CSV y
Kwalify (YAML + JSON) y y y y y
XML/NEIM y y n? y y
OKFN/CSV + JSON y ? n n y

bsweger avatar Aug 31 '15 17:08 bsweger

During the past few sprints, we've been working on converting this pilot's data to XBRL, using the schemas posted here: http://fedspendingtransparency.github.io/data-exchange-standard/.

The learning curve is steep, and the open-source XBRL ecosystem is not ideally robust.

The open-source XBRL GUI we're playing with: Arelle

The Python library used to create classes based on the schema files: generateDS. Had to manually install (PyPi version wasn't current; pip install from BitBucket didn't work)--seems a bit janky.

Neither of these tools seems to have a large community or be actively updated. I'd be reluctant to rely on them.

Is XBRL an example of an open standard that's too complicated to use without commercial tools? What are tools/best practices used by others in the open source community for converting data to XML/XBRL?

bsweger avatar Aug 31 '15 17:08 bsweger

This is great, and I'm glad that someone is thinking of this! I have a few initial thoughts to add in here now. I apologize for how scattered these messages are, but this is a surprisingly broad topic. I almost wonder if there is a better format we should be using for discussing this topic in public. Anyhow, some points for now in no particular order:

  1. I think I would like to consider the utility of any data standard in terms of three distinct audiences:
    • biz/tech in the relevant domain (e.g., financial customers with an active interest in the data)
    • technical users who aren't necessarily domain experts (e.g., civic coders, information designers, data journalists)
    • nontechnical members of the general public who might read script-produced but human-readable extracts of the data

As a former data journalist, I have my own biases here, but it strikes me that heavily-convoluted schemas tend to favor the first audience and disadvantage the second. And I feel they tend to disadvantage the third because they make it harder for news organizations and other civic groups like treasury.io for instance to try to make complicated information more understandable to the general public.

  1. A lack of internal data validation is not necessarily a dealbreaker for any given format. Many existing validation frameworks were essentially developed so that serialized data structures could be ingested into strongly-typed languages (eg, XSD, Protocol Buffers), and that specific validation is either ignored by dynamic languages that consume the data or can be emulated with explicit checks. The XBRL does provide a complicated mechanism for specifying relationships like formulas between XML elements in the schema definition, but this is also something that could be explicitly enforced in simple dynamic languages with more straightforward assertions expressed in a style similar to unit testing.
  2. I would like to consider the CSV format within this discussion as well. It admittedly does not include a machine-readable schema definition or native type representations, but CSV does have an advantage in that it is easily parsed and can even be opened up within Excel or other spreadsheet programs by non-technical users to verify data. This is not true of any of the other formats here.

harrisj avatar Sep 08 '15 17:09 harrisj

One other thing that occurs to me literally after I submitted this last comment: even the most comprehensive level of validations (whether expressed as machine-readable schema relations or programmer-written post-loading tests) will often miss some caveats/issues that are about how the data is collected or how it should be interpreted by end users:

  1. For instance, I have seen data sets where for legacy reasons a special numeric code like 99999 is meant to be interpreted as null and not a number. Or a field is mostly numeric, but sometimes is sent with a string like N/A or NULL instead
  2. The data set might not be comprehensive and missing the contributions from certain agencies or states or such.
  3. The data might be for a different interval of time or unit than the user is expecting it to be (only 9 months instead of a year, cents instead of dollars)
  4. The data itself might be approximate or measure something different than users expect or have other restrictions that make it not suitable for certain applications. For instance, using a database of reports about acts of terrorism as a measure for actual acts of terrorism.

I'm sure there are other cases, but I guess the tl;dr point I want to make is that even the most automated validation mechanisms will not remove the necessity of creating documentation for this data and the requirement of users to read it. In some cases, automated validation might make things worse by making people assume that programming errors are caught automatically...

harrisj avatar Sep 08 '15 18:09 harrisj

It seems logical to to me to consider the overall manageability of a schema in terms of what needs to be done to make it usable to the (inevitably different) groups of users who benefit directly or indirectly from the schema's use. As +harrisj points out not all users will be "domain experts" in terms of process or technology. Also, whether a schema is "open source" or not may be secondary to whether not the schema is known and understood widely enough to support (or require) the development of maintainable tools to support its management and use. I've seen instances where the lack of tools to support schema adoption has significantly slowed standards adoption and hope that is not the case here.

ddmcd avatar Sep 08 '15 19:09 ddmcd

I don't see OpenSpending in that list. Their approach is basically:

  • publish data as CSV
  • accompany it with JSON in Budget Data Package format
  • feed it into an ecosystem of datasets from around the world, tools etc.

It's currently being overhauled (a victim of its own success, they're reaching the limits of the current architecture), so you could consider helping work towards the new version.

stevage avatar Sep 09 '15 01:09 stevage

@harrisj Thanks for weighing in. Yeah, this is a big topic to tackle in a GitHub issue 😁

Your audience breakdown is particularly helpful. And to your point that "heavily convoluted" schemas favor some groups of users over others, it seems that the opposite is not true. In other words, if a simpler schema + explicit validations could accommodate the second and third groups, the first group doesn't lose anything.

I added csv to the table up here. You're right--it should be there.

bsweger avatar Sep 09 '15 01:09 bsweger

@ddmcd Good point about a lack of tools slowing a schema's adoption--thank you. Do you have a preferred way of expressing a data standard/schema?

bsweger avatar Sep 09 '15 01:09 bsweger

@stevage Thanks for the pointer to OpenSpending. The use of .csvs as the lowest common denominator for transmitting data plus JSON for structure is intriguing. The CSV + structure is similar to the approach we're using in this project's prototype, except that the mapping is done internally before data is transmitted, so there's no external insight into how it was done. Some great stuff here to think about.

bsweger avatar Sep 09 '15 15:09 bsweger

FWIW, the W3C has been working on CSV standards for the web for a while. I haven't dug too deeply into the metadata stuff, so I can't speak to its usefulness here. But I would like to register my support for CSV generally, because it's the most "backward-compatible" format (and still one of the most compact representations) out there.

shawnbot avatar Sep 09 '15 16:09 shawnbot

Extensible is a confusing column for me, because that could mean a variety of things. I'm assuming it refers to the ability of users to add their own additional columns beyond what is provided in the base data file. It seems like there are some caveats for what is there:

  • XML is indeed the extensible markup language, but I think adding new fields to an XBRL file requires the coordinated step of defining a proprietary XSD schema that extends the base schema and hosting it in a location that can be accessed by third-party tools that need that schema to validate your XML. For some users, this might also mean they have to recompile header files or other programming structures so they can deserialize your changed XML (and they might have some type collisions if also trying to load non-extended XBRL with unextended types in the same namespace. In short, it's complicated.
  • JSON is indeed extensible in that it is possible to add other fields to existing structure. There is no inherent schema in the JSON that would have to be modified (unless we decided to adopt a standard for that), but the challenge is that without proper naming, it's hard to distinguish which fields are in the original unextended structure vs. which were added as part of the extension.
  • Similarly, extending a CSV involves merely adding some columns to right of the existing table. This would be compatible with existing loaders that refer to columns by position or name, since those would simply ignore the new extensions, but there is the same issue of naming that exists with JSON records. In addition, it gets a bit ugly if we are trying to merge two distinct extensions in a new file. Which extension's columns come first? How do we handle name collisions?

I guess I have one other question about these formats and integer/numeric types and date types. Does it use binary formats? Text with some sort of explicit type declaration? Or just plain text with no explicit typing?

harrisj avatar Sep 09 '15 17:09 harrisj

@bsweger I think that restricting yourself to the use of only open sources tools for working with the data is a challenge. Open source is great for where it fits, but falls flat where it doesn't - which is often in areas without a very large/broad user base to have an interest in building and maintaining the open source tools. I don't think the community in this space is that large. I think that the community doesn't really care if there is an open source or proprietary tool so long as they have a tool for the job. I too have spent time looking for open source tools related to XBRL and have found few/none. My next thought, that is still just a thought, was to see if I could co-opt an open source, XML-ingesting ETL tool for the purpose.

One other thing that you may want to consider in your analysis is the availability of learning materials and opportunities in the ecosystem. XBRL for example is a global organization with conferences and such. There is an entire track devoted to academia so there is research. I'm not extremely XBRL-fluent myself, having learned mostly from my interactions related to the DATA Act. It's essentially just an extended XML schema though - right? With governance process to manage it.

Beyond the technical, something can be said for using a standard in use by other similar entities (government or financial) for similar purposes because all that use builds an ecosystem (open is nice, but getting the job done is better). For example, the use of automated compliance rules for regulators. Sure the rules could be hand-coded over and over in code to check the data.. but I'd think building the rules in the schema to be more efficient than in the code when it comes to re-use of the rule elsewhere. I'm thinking rule re-use and easy broad adoption, not code re-use. Rule re-use enables broad adoption and agreement on the rules through some sort of governance and compliance mechanism. A final thought, changing the model is often less impactful from an IT perspective than changing the code that runs the model (so long as the code truly uses the model and metadata vs hard coding). Again, I'm still diving into the topic in my spare time. Just sharing my thoughts. I wish I had more time to dig deeper! I look forward to the outcome!

HerschelC avatar Sep 09 '15 18:09 HerschelC

I don't know much about it yet, but kwalify applied to YAML or JSON seems to meet:

Defining Structure/Type: y Structure is Extensible: y Data Serialization: y Data Validation: y Human-Friendly: y

catherinedevlin avatar Sep 10 '15 07:09 catherinedevlin

I think NIEM would be another option to consider. It's less technical and more process - but something already well-known in the government data exchange space. I'm again leaning towards the governance structure and process around the exchage protocol being more important than the technical under the hood.

HerschelC avatar Sep 10 '15 13:09 HerschelC

At Open Knowledge we've been working on very simple open source methods for shipping around data under the heading of "Data Packages" for some years now:

  • http://data.okfn.org/ - main Frictionless Data / Data Packages site
  • http://data.okfn.org/doc/tabular-data-package - Tabular Data Packages designed for shipping around tabular data (CSV + metadata in JSON)
  • http://os-data.github.io/boost-paraguay/viz/treemap/ - a simple example of using Tabular Data Packages / Fiscal Data Packages to describe some fiscal data (Paraguay's BOOST data from the World Bank) and then use some tooling to create nice visualizations - raw data package in github at https://github.com/os-data/boost-paraguay

rufuspollock avatar Sep 10 '15 14:09 rufuspollock

@catherinedevlin @HerschelC @rgrp above table has been updated with your suggestions--thank you!

Let me know if I got anything wrong, and apologies for the somewhat awkward format of jamming all of this into a GitHub thread. The input is much appreciated!

bsweger avatar Sep 10 '15 15:09 bsweger

You might also take a look at Microformats: http://microformats.org/.

ryanbharvey avatar Sep 10 '15 15:09 ryanbharvey

Also, RDF (http://www.w3.org/RDF/) and RDFa (http://rdfa.info/).

ryanbharvey avatar Sep 10 '15 15:09 ryanbharvey

You might want to look at the Open Geospatial Constorium ( OGC) , http://www.opengeospatial.org/, which is a standards body specializing in open standards for geospatial data. A list fo the current OGC standards can be found at http://www.opengeospatial.org/standards/is A quick read through the reference model , http://rap.opengeospatial.org/orm.php , might be useful.

My personal opinion on open standards is that it is not an open standard unless there is an open source reference implementation of the standard.

OGC has several standards, but a couple that might be useful to look at would be the geopackage data format, http://www.geopackage.org/spec/ , the Web Mapping Service ( WMS ), the Catalogue Service (CSW) , and the Sensor Observation Service ( SOS)

newcomb-d avatar Sep 10 '15 16:09 newcomb-d

Having dealt with managing/converting hundreds of data formats and services over the last few years - there are notable advantages that @harrisj and @stevage mentioned. I think the thread is hard to digest because Open Source tools are not being distinguished from Open Data Formats/Services/API standards? Can someone clarify/delineate the discussion? as these things are different creatures...

Focusing on the Data itself (i.e standards/specs/formats) of this - I would highly advise against using a XML based standard and would caution the use of CSV as a data exchange format:

Avoid eXtendable Markup Languages (XML)

<Parent_Variable type="DataType">Value</ParentVariable>
 <Child_Variable type="DataType">Value</ChildVariable>

XML is really NOT that extendable because it derives data relationships from nesting data in a hierarchy (i.e. indentation). So adding new fields or remapping hierarchy isn't impossible - but it becomes very impractical for these reasons:

  • Rendering/Parsing of XML requires that both parties (Data Producer/Consumer) conform to the same schema - any remapping has to be done on both sides the the fence this means reaching new consensuses on changes and updating the actual systems that exchange the data.
  • Requires Data conformity that the data types for each field (i.e. text string, boolean, float, integer maintain their types.
  • Its mind-numbing to read/interpret as a human

In other words there is too much rigidity in the structure/data types - rigidity is nice if things never change - but they always will.

Avoid Deliminated formats (CSV) for Data Exchange

Data formats where data is stored as text with a special character (:/,/;/tab/|/etc) deliminating between each value like CSV are 'file' formats convenient for download but ARE NOT useful/practical for data services/exchanges/APIs because the whole/contiguous file needs to be downloaded before it can be opened (i.e. you cannot stream portions or individual data entries separate from the whole dataset) . There can also be issues maintaining the data structure (i.e some rows have more columns or don't match to the correct column heading) if values in a field contain the same characters that are suppose to deliminate between those values.

Use JSON as the Data Exchange format

Because everyone already does and because it doesn't have the issues above ;)

JJediny avatar Sep 10 '15 16:09 JJediny

While a well documented binary format may not be human readable, it sure saves on bandwidth and storage!

newcomb-d avatar Sep 10 '15 16:09 newcomb-d

@JJediny asked...

I think the thread is hard to digest because Open Source tools are not being distinguished from Open Data Formats/Services/API standards? Can someone clarify/delineate the discussion? as these things are different creatures...

You're right--there's a lot happening in this thread. A focus on the open data formats/services/standards rather than tools is the intent. Thanks for your great points!

bsweger avatar Sep 10 '15 17:09 bsweger

ICA Commision on Geoinformation Infrastructures and Standards SDI-Open 2015 meeting proceedings are available for download at http://sdistandards.icaci.org/

newcomb-d avatar Sep 10 '15 22:09 newcomb-d

I would argue that NIEM (not NEIM) is Human-Friendly (and should be "y") given a rich set of requirements around developing conformant information exchanges (i.e. IEPDs) by focusing on required technical schemas AND business documentation artifacts for reuse. Reuse is only promoted via a human-friendly understanding of the exchange.

The NIEM model itself has open source tools that makes searching model content easy and offers type and element definitions to aid in understanding content meaning. Lastly, a UML Profile for NIEM exists to aid business users with understanding content relationships.

Data Serialization should be "y" as XML instances are created from NIEM-conformant XML schemas with many implementation examples across industry, federal, state, local and international governments.

jstekervetz avatar Sep 10 '15 23:09 jstekervetz

I suggest renaming this issue to: Open standards for data schemas and serialization formats

  • Let's use the Bruce Perens definition of open standard which is widely used and even included in some government policies.
  • Let's define a vocabulary as a controlled set of terms with agreed upon meaning. Vocabularies can be independent of schemas and serialization formats - and they often are, but they depend on a agreed upon schema and serialization format to enable machine to machine interoperability.
  • Let's define a data schema as a way of representing terms in the vocabulary using specific data types and a way of representing the relationships between the terms. Schemas can be independent of serialization formats, but are usually tied closely to one.
  • Let's define a serialization format as a standardized machine readable syntax and file format for encoding data. This is often done using a set data schema, but not always.

When talking about open data standards, it's usually assumed that the standard specifies not only the vocabulary but also the schema and the serialization format, even if multiple serialization formats are supported for a given schema. If you look at the schemas on schema.org you can see some good examples of this breakdown. These schemas are made out of a broad vocabulary of terms (often reusing established vocabularies like Dublin Core) which are then assembled into different schemas. Look at the Government Service schema as an example and at the bottom of the page you'll see examples of different serialization formats to encode the schema including JSON, and HTML microdata. NIEM tries to take a similar approach as Schema.org for defining a vocabulary and schemas for use in government that are independent of serialization format. Originally the main serialization format for NIEM schemas was XML, but they can also be represented in JSON and other formats. For what it's worth, I think NIEM and the DATA Act should aspire to the simplicity of the documentation of schemas on schema.org. I don't think it's acceptable to only publish the complete schema documentation as Word documents and XSD files and it's almost impossible to find the actual terms and schemas on the NIEM website.

The openness question with data standards is somewhat related to open source software, but it's helpful to keep them separate. It is true that many standards efforts require an open source reference implementation in order to consider it an open standard, but it's also possible to have an open standard (using the Bruce Perens definition) without there being usable (or widely used) open source tools. Proprietary data standards on the other hand often prevent the existence of open source tools to serve them in the first place (without reverse engineering or license fees). In government, we're typically required to use open standards (see A-119), so that shouldn't really be a question. I think the question you're asking is what open standards have robust ecosystems of developers and tools that support them, including open source options.

From what I can tell, it looks like you've already established most of the vocabulary as seen on the data elements page but as far as defining a schema and serialization to enable machine readability and machine interoperability of the vocabulary, what we currently have is an example of how this could be done using XBRL as seen on the data exchange page.

I haven't spent much time with XBRL but my understanding is that it is a data standard in a way that encompasses vocabulary, schema (XSD), and serialization formats (XML) and the example we have now extends XBRL with the terms that have been defined for the DATA Act vocabulary. It appears that there's a lot of activity around XBRL and ongoing updates to the standard (which date back to 2003), but it's not clear to me whether this legacy is helpful or not. It seems like XBRL is viewed as complex, cumbersome, lacking a robust ecosystem of tools and developers, and perhaps not widely used by relevant audiences. That said, usage of XBRL in other areas may warrant XBRL-based interoperabilty with DATA Act data. It doesn't necessarily have to be one or the other, but it's clear that there are some issues with an XBRL-only approach.

The opposite extreme in terms of simplicity and a robust ecosystem of tools and developers is the CSV. One of the things that makes CSV's such a great format is that they're incredibly easy to use for both developers and non-developers and they probably have the broadest range of tooling available. The downside to CSV's is that they don't provide a schema which is needed not only for validation and ensuring interoperability, but also for helping developers understand the data. Fortunately, the recent aforementioned efforts like Tabular Data Package (hat tip @rgrp) look like they're doing a great job of addressing this by combining CSVs with a schema defined in JSON using JSON Table Schema (which builds on JSON Schema). @shawnbot also mentioned the W3C's CSV work, but the Metadata Vocabulary for Tabular Data spec is actually based on JSON Table Schema and shares authorship, but it seems that JSON Table Schema is more actively used and developed at this point.

The Tabular Data Package seems like a solid candidate for the canonical data schema and serialization, but once you have that it's fairly trivial to generate derivatives in other serialization formats including JSON or XML. As mentioned earlier, the OpenSpending Budget Data Package Standard uses this approach, but so does the Open Contacting Data Standard. Remember the Schema.org Government Services schema I mentioned earlier? The OpenReferral project is also building on this as a standard with the Human Services Data Specification which uses Tabular Data Package as well.

While Tabular Data Package is still fairly new, it does already seem to come with good tooling (e.g. OKFN's Good Tables validation service/code) and the fact that it builds on existing well established ecosystems for CSVs and JSON Schema is also quite encouraging.

I do think it's critical to provide a machine readable schema for validation that's code-independent if you want to foster a robust ecosystem of tools that support machine to machine interoperability. The Project Open Data schema is defined using JSON Schema which makes it much easier for multiple codebases to stay in sync whenever changes to the schema are made. This also makes it possible for others to implement validation so we don't all have to depend on one central validation service (though it is important to provide a public one - as Data.gov does). Several agencies have used the canonical JSON Schema for POD Schema within their own codebases and even within Data.gov we use the same schema file with several different code bases in multiple programming languages.

Once you have the basic schema and serialization in place (e.g. Tabular Data Package), you probably want to think of more sophisticated ways to make the data available such as a RESTful API that returns queries in JSON, but I think it's important to start with raw bulk data interoperability before we start thinking about standard ways of exposing that as an API. Fortunately, some of the components may be reusable (e.g. you could reuse the JSON Schema definitions from JSON Table Schema for your API standard definition in Swagger).

One other important consideration that I haven't seen discussed much is file size and transmission rate. Proto Buffers are probably more valuable when bandwidth reduction, realtime transmission, and parsing efficiency is important, but the serialization format is relevant for large files even when you're not dealing with a network. Stream processors for both JSON and XML are widely available, but they're always going to be more complicated than streaming a CSV which can just be read line by line. Efforts like JSON Lines are trying to make this easier for JSON, but in most cases (if you're not using a special stream processor) you have to read an entire JSON or XML payload into memory before you can parse it which can be much more limiting than parsing a CSV line by line.

Normalization versus de-normalization is also worth considering. JSON and XML are often used to represent de-normalized relational data whereas a Tabular Data Package would probably be normalized across multiple CSVs just like multiple tables in a relational database. As a bulk data format, my guess is that most people are going to load this data into a spreadsheet or database anyway, so I would suspect normalization is what you want. Plus, normalization reduces file size. Maybe once you're at the API stage de-normalization will be more appropriate.

philipashlock avatar Sep 11 '15 00:09 philipashlock

I vote YES for an API that allows [1] a parameter to specify a response in either JSON or XML, [2] JSON response with a supported option for a callback parameter, which should allow robust responses depending on the requesting application/framework.

47ronin avatar Sep 11 '15 01:09 47ronin

Financial data gets big quite quickly (millions/billions of transactions, time points, people, ...), so I'm pretty confused why only one format relevant to nation-scale data is considered.

lmeyerov avatar Sep 11 '15 03:09 lmeyerov

(Most of those formats seem more relevant for messaging, and poor for sharing anything sizeable or allowing compute over it.)

lmeyerov avatar Sep 11 '15 03:09 lmeyerov

+1 to what @philipashlock said.

It's also pretty dependent on the data and the audience(s). Here's my general, straight-to-the-point thoughts on opening up data: https://github.com/zzolo/so-you-want-to-open-some-data

zzolo avatar Sep 11 '15 17:09 zzolo

From the Python and Hadoop side of things, I can attest that CSV and JSON alone are problematic from a tooling and fidelity-of-analysis point of view as canonical formats. JSON is acceptable as long as an official schema is published so that the JSON data can be transformed into a binary format like Avro or Parquet for analysis. For larger datasets, Parquet is attractive because it can be analyzed efficiently with Spark, Hive, Presto, Drill, or Impala without need for ETL into columnar format.

Making data exportable in Avro has become an increasingly popular choice — Google BigQuery utilizes it as one of their binary export formats.

So, I would recommend making JSON and a binary-format with a built-in schema (e.g. Avro) both available to data consumers.

wesm avatar Sep 12 '15 19:09 wesm