pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

Xport files NULL original variable type and Informats

Open DanteDT opened this issue 9 months ago • 17 comments

UPDATED based on follow-ups: The README statement should state:

  • "Probably you want to read a file produced in the original application and use meta.original_variable_types to get the formats."
  • There is one issue with handling of NULL formats, noted in discussion below

To Reproduce

Create XPT files from SAS 9.4

/*** Create V5 XPT ***/
%LOC2XPT(FILESPEC='../output/stks_v5.xpt' , LIBREF=SASHELP, MEMLIST=STOCKS, FORMAT=V5)

/*** Create V8 XPT ***/
%LOC2XPT(FILESPEC='../output/stocks_v8.xpt' , LIBREF=SASHELP, MEMLIST=STOCKS, FORMAT=V8)

Check meta read in by read_xport()

import pyreadstat as prs
import sys

print(f"Python version: {sys.version}")
print(f"pyreadstat version: {prs.__version__}")

df_v5, meta_v5 = prs.read_xport("stks_v5.xpt")
df_v8, meta_v8 = prs.read_xport("stocks_v8.xpt")

try:
    meta_v5.original_variable_types
    meta_v8.original_variable_types
except AttributeError as err:
    print(f"failure message: {err}")

prs.write_xport(df_v5,
                "prs_v5.xpt",
                table_name='PRS_STK5',
                variable_format=meta_v5.original_variable_types,
                column_labels=meta_v5.column_labels)
prs.write_xport(df_v8,
                "prs_v8.xpt",
                table_name='PRS_STK8',
                variable_format=meta_v8.original_variable_types,
                column_labels=meta_v8.column_labels)

Code works fine. One note about implicit formats in discussion, below

Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
pyreadstat version: 1.2.8

File example Above should suffice to reproduce?

Expected behavior README is up to date, and approach works.

Setup Information: See code output above

DanteDT avatar May 08 '25 16:05 DanteDT

in case it helps, also updated to STOCKS instead of CLASS, as noted below

sas94_loc2xpt_stocks_v5_v8.zip

DanteDT avatar May 08 '25 17:05 DanteDT

Thanks for the report. I think it is an error in the Readme, it should be meta.original_variable_types. Notice however that I think (as far as I remember from having seeing this before, but also I have to say it was some time ago), that in SAS the variable format must be set manually for it to be visible. In your files all of them are set to NULL right now. You can for comparison see sas.xpt5 and sas.xpt8 in test_data/basic in this repository, where the formats are set.

ofajardo avatar May 12 '25 08:05 ofajardo

Indeed - My choice of test table was bad. I've updated snippets above to the SASHELP.STOCKS table (with explicit formats). meta.original_variable_types works!

One note about handling of NULL or implicit formats, mentioned above. Handling could be better, since $NULL. is not a native format:

  • SAS 9.4 %LOC2XPT produced the first table (attached above)
  • pyreadstat produced the second via code above (attached here)

prs_write_xport_stocks_v5_v8.zip

Image

DanteDT avatar May 12 '25 21:05 DanteDT

OK, regarding the string 'NULL', I have changed it to be a python None, I attach here the files produced with this variation, would you be so kind to check in the SAS viewer if now the formats look as expected?

null2none.zip

ofajardo avatar May 13 '25 10:05 ofajardo

Thank you! Looks good. The INformat outcomes (to match FORMATs) seem fair enough to me. I highlight simply for transparency. There could be scenarios when such and "inferred" INformat is not valid.

Image

DanteDT avatar May 13 '25 17:05 DanteDT

Thanks for checking!

I think the wrong formats probably come from Readstat, so it would need to be corrected there.

ofajardo avatar May 13 '25 18:05 ofajardo

Just to understand, and to report correctly to readstat:

  • Formats are from readstat
  • Informats are somehow separately and specifically from readstat?

Informats seem particular to SAS/XPORT, so perhaps the root issue is that readstat does not support SAS informats (which are therefore simply inferred from the broadly relevant display formats).

I checked readstat and pyreadstat docs and find little to no mention specifically of INformats.

DanteDT avatar May 13 '25 20:05 DanteDT

Ok sorry, let me elaborate on my answer:

Readstat has only one function to get the variable format when reading and one function to set the format when writing. So I am limited to use those functions to read and write the formats, since Readstat does not distinguish between informats and outformats, I cannot recover both or distinguish between both.

What I think is happening here is that Readstat is reading the Format (and not the Informat). So when you read the original file produced by SAS with pyreadstat you get the formats (not the informats), and then you pass them to the writing function, so now those Formats become the new Informat in the file produced by pyreadstat. I am assuming here Informat is the format of the variable in the input file and Format is the format is what SAS sets finally for the file ... In your original files the Informat looks like sas7bdat ones, while in the files written by pyreadstat the Informats look exactly the same as the Format in the original file.

To me it looks correct and makes perfect sense the way it is implemented, but if you would need to recover also the original Informats (to pass them when writing I guess), that would need somebody implementing reading and writing those in Readstat, because as mentioned before right now there is only one function to read and write formats and (apparently) that one is returning and setting the Format and not the Informats.

ofajardo avatar May 14 '25 07:05 ofajardo

Digging a bit deeper I have found a couple of interesting things:

  • There is a mention to both informats and formats in the Readstat file to read xport, so I assume it is reading both, however as mentioned before I do not see a way to recover the informats/formats separately. I do not see a similar mention in the writing file.
  • There is this discussion on R Haven, where they mention that when writing both informats and formats are set, that would explain what we are seeing here. @gorcha did a PR to Readstat around this to make some improvements, so maybe gorcha if kind enough could explain more.
  • Some years ago @reikoch did a PR to Readstat to implement reading or writing XPT8. I vaguely remember to have discussed him something about formats and informats, maybe he also knows something about the topic.

ofajardo avatar May 14 '25 08:05 ofajardo

Hi guys - good to hear from you again! I did concentrate on labels, in particular dataset labels. Informats are such a SAS-specific thing that I think they "fall down the cracks", maybe they are read correctly, but as far as I see nothing is done with this info. ;-)

reikoch avatar May 14 '25 15:05 reikoch

That's a helpful reference in Readstat file to read xport!

I redirected to readstat project, with an improvement request:

DanteDT avatar May 16 '25 16:05 DanteDT

the 'NULL' to None is now available in version 1.2.9

ofajardo avatar May 17 '25 10:05 ofajardo

evanmiller added readstat c functions

const char *readstat_variable_get_informat(const readstat_variable_t *variable);
void readstat_variable_set_informat(readstat_variable_t *variable, const char *informat);

I struggle to test that directly in my env. I think I could test that through updated pyreadstat package.

DanteDT avatar May 23 '25 19:05 DanteDT

Sure, I am a bit busy right now, but will do as soon as I can

ofajardo avatar May 25 '25 07:05 ofajardo

OK, I did the changes in the branch informat_dev, seems to work. You will be able to read the informats from metadata.original_variable_informats and write passing the argument variable_informat.

Notice that to write in version 5, you need to pass file_format_version=5.

import pyreadstat as prs
import sys

print(f"Python version: {sys.version}")
print(f"pyreadstat version: {prs.__version__}")
print(f"pyreadstat file: {prs.__file__}")

df_v5, meta_v5 = prs.read_xport("stks_v5.xpt")
df_v8, meta_v8 = prs.read_xport("stocks_v8.xpt")

try:
    print("ori5", meta_v5.original_variable_types)
    print("ori8", meta_v8.original_variable_types)
    print("ori5", meta_v5.original_variable_informats)
    print("ori8", meta_v8.original_variable_informats)
except AttributeError as err:
    print(f"failure message: {err}")


prs.write_xport(df_v5,
                "prs_v5.xpt",
                file_format_version=5,
                table_name='PRS_STK5',
                variable_format=meta_v5.original_variable_types,
                variable_informat=meta_v5.original_variable_informats,
                column_labels=meta_v5.column_labels)
prs.write_xport(df_v8,
                "prs_v8.xpt",
                table_name='PRS_STK8',
                variable_format=meta_v8.original_variable_types,
                variable_informat=meta_v5.original_variable_informats,
                column_labels=meta_v8.column_labels)

df_v52, meta_v52 = prs.read_xport("prs_v5.xpt")
df_v82, meta_v82 = prs.read_xport("prs_v8.xpt")
print(meta_v5.original_variable_types)
print(meta_v8.original_variable_types)
print("INFORMATS")
print(meta_v5.original_variable_informats)
print(meta_v8.original_variable_informats)

ofajardo avatar May 26 '25 14:05 ofajardo

Thanks, once again. I'll check the informat_dev branch.

DanteDT avatar May 26 '25 18:05 DanteDT

Looks great to me. Thank you!

Python version: 3.9.12 (main, Oct  5 2023, 23:34:09) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)]
pyreadstat version: 1.3.0

ori5 FMT {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'DOLLAR8.2', 'HIGH': 'DOLLAR8.2', 'LOW': 'DOLLAR8.2', 'CLOSE': 'DOLLAR8.2', 'VOLUME': 'COMMA12', 'ADJCLOSE': 'DOLLAR8.2'}
ori5 INF {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'BEST32', 'HIGH': 'BEST32', 'LOW': 'BEST32', 'CLOSE': 'BEST32', 'VOLUME': 'BEST32', 'ADJCLOSE': 'BEST32'}
ori8 FMT {'Stock': None, 'Date': 'DATE', 'Open': 'DOLLAR8.2', 'High': 'DOLLAR8.2', 'Low': 'DOLLAR8.2', 'Close': 'DOLLAR8.2', 'Volume': 'COMMA12', 'AdjClose': 'DOLLAR8.2'}
ori8 INF {'Stock': None, 'Date': 'DATE', 'Open': 'BEST32', 'High': 'BEST32', 'Low': 'BEST32', 'Close': 'BEST32', 'Volume': 'BEST32', 'AdjClose': 'BEST32'}

FORMATS (types)
prs5 FMT {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'DOLLAR8.2', 'HIGH': 'DOLLAR8.2', 'LOW': 'DOLLAR8.2', 'CLOSE': 'DOLLAR8.2', 'VOLUME': 'COMMA12', 'ADJCLOSE': 'DOLLAR8.2'}
prs8 FMT {'Stock': None, 'Date': 'DATE', 'Open': 'DOLLAR8.2', 'High': 'DOLLAR8.2', 'Low': 'DOLLAR8.2', 'Close': 'DOLLAR8.2', 'Volume': 'COMMA12', 'AdjClose': 'DOLLAR8.2'}
V5 FORMATS (type) roundtrip OK? True
V8 FORMATS (type) roundtrip OK? True

INFORMATS
prs5 INF {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'BEST32', 'HIGH': 'BEST32', 'LOW': 'BEST32', 'CLOSE': 'BEST32', 'VOLUME': 'BEST32', 'ADJCLOSE': 'BEST32'}
prs8 INF {'Stock': None, 'Date': 'DATE', 'Open': 'BEST32', 'High': 'BEST32', 'Low': 'BEST32', 'Close': 'BEST32', 'Volume': 'BEST32', 'AdjClose': 'BEST32'}
V5 INFORMATS roundtrip OK? True
V8 INFORMATS roundtrip OK? True

z_pyreadstat_stock.py.txt

Image

DanteDT avatar May 31 '25 19:05 DanteDT