Reading `.dta` with value labels
As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at 1, and a Dict going from Int => String.
Accessing the string values, which we generally care the most about, is hard with ReadStat. You have to
- Use ReadStat not StatFiles to access the internal fields of the Stata File
- Construct the
DataFamefrom the data and header fields 3 . Use thevalue_label_dictfield to perform the replacement - Use
geton the DataValue elements of the array
This is not the most user friendly thing.
There isn't a great solution for this in Julia as we dont have a CategoricalArray equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.
haven in R recently made a change with how this is handled with the <dbl+lbl> vector type. Though working with it is a bit of a pain, see here.
I can email a data-set to someone with an MWE for more information.
I would like to work on this as I have to deal with .dta-Files quite regularly and I know the pain of handling Stata labels (in R or in general). I have also read the issues on adding metadata to dataframes and the discussion regarding metadata in DataAPI. As I believe to come from a similar context (lots of household survey data), I agree with a lof of the points @pdeffebach made there, especially about persistent metadata (like in Stata) being super useful. However, as there does not seem to be a great solution on the horizon, what would be the general idea to implement a solution that allows for a better workflow with .dta-Files?
Is the idea to create a global dict which allows for swapping integer with string labels though some mapping based on column name? Should I look into Metadata.jl as a possible dependency for that? I have not worked with Metadata.jl before but as far as I understood it seems to use the approach of a global dict.
Might be that I need a lot of guidance as this is my first open-source contribution, sorry in advance.
I think a custom array type would handle this pretty easily. Something based off of CategoricalArrays.jl. But that might be a big task for someone doing their first open source contribution.