opendiversitydata icon indicating copy to clipboard operation
opendiversitydata copied to clipboard

figure out how to make this stuff machine readble

Open hypatia opened this issue 11 years ago • 7 comments

here's the CSV/XML format: http://www.eeoc.gov/employers/eeo1survey/eeo1_cvs_specifications.cfm

ASCII/text format: http://www.eeoc.gov/employers/eeo1survey/ee1_datafile_2013.cfm

hypatia avatar Jun 26 '14 17:06 hypatia

Happy to help out on this, but is YAML the best format? D3, for example, provides data loading functions for CSV/TSV, XML, and JSON blobs natively, but not YAML.

jebeck avatar Jun 27 '14 18:06 jebeck

@jebeck I have no strong feelings about this - whatever you think would be best! Could you have a glance at the existing CSV/XML stuff I linked to and see if we could just work with that directly?

hypatia avatar Jun 27 '14 18:06 hypatia

I glanced at the CSV spec, and it looks pretty terrible (unlike information distributed across rows instead of columns, which is generally not friendly data design). The XML spec may be better, but I also wonder how often companies choose to submit in this form? In any case, happy to take on the task of writing some tools to translate between the official specs and a simplified format (I'd argue for JSON). We should chat about tools - might be able to keep it all client-side and do JavaScript, or could do Python.

jebeck avatar Jun 27 '14 23:06 jebeck

Do we have access to the eeo-1 csv or xml files? I think setting up automated tooling to transform these files may be quite a bit of effort in order to parse a small amount of data that is getting updated and added at a relatively slow pace.

I don't mean to suggest that we shouldn't do this, but I would like propose a few alternatives that I think may get use human & machine readable data faster. I think the following strategies may get us a win (that admittedly isn't particularly flexible) in a short amount of time:

  • Enter data by hand (I think this would take less time than making something automated. Issue #22 already has most of the available data, though I think we need to add the leadership demographic breakdown.)
  • Use a pdf parsing tool like Tabula (thanks @ameliagreenhall for pointing this out)

Also, +1 on aiming for JSON as the data format we keep in this repo. It is machine readable, and human readable & editable.

hougs avatar Jun 29 '14 16:06 hougs

I saw on the double union mailing list that another goal is it advocate for a standard data format to release diversity data in. This seems related, but not necessarily dependent on making the currently released data machine readable. Perhaps we could make another ticket for it?

hougs avatar Jun 29 '14 16:06 hougs

AFAIK, @jhlch, we don't have access to the CSV and/or XML data. I think each corporation gets to decide how they want to submit the data (see the links @hypatia pasted opening this issue), and I doubt we're going to have very many, if any, of them releasing the data in these formats.

Given the very small size of these datasets (at least compared to some of the data I'm used to working with...), I think transcription won't be a completely heinous task, unless we start getting 1000s of companies to release data(!!!) (Take note that many of the "PDFs" submitted so far are actually images of a PDF, so something like Tabula isn't going to help much.) Another possibility I'd like to try is setting up a client-side GUI app for transcription; we should be able to leverage the download attribute in browsers that support it to let a transcriber download the results of the form and send it in. Does that sound like a good idea to anyone else or just me? ;)

All in all, my proposal is the following path (and yeah, these should be split out into separate tickets if there's consensus):

  • [ ] write a simple JSON Schema to document the standard format (this will mean we can easily validate crowd-sourced transcriptions)
  • [ ] make a client-side GUI for transcription (basically just a big form, probably laid out in the same way that the PDFs are laid out, so transcription is dead easy)
  • [ ] if desired, write some code to translate between data formats, so that if we have a standard JSON we can also allow downloading the CSV or XML gov't spec; I don't think this is necessary, it'd just be cool :)

I've got some vacation coming up this week, and I've got some other projects to work on as well, but I could definitely do the JSON Schema proposal, maybe get a start on a simple transcription form.

jebeck avatar Jun 29 '14 20:06 jebeck

I remembered that we have a gmail account for open diversity data. I bet we could make a google form, and get a spreadsheet auto populated in google docs. This may be an option to consider for a client side gui for crowdsourcing parsing the pdf data. Just a thought.

hougs avatar Jun 30 '14 06:06 hougs