Xport files NULL original variable type and Informats
UPDATED based on follow-ups: The README statement should state:
- "Probably you want to read a file produced in the original application and use meta.original_variable_types to get the formats."
- There is one issue with handling of NULL formats, noted in discussion below
To Reproduce
Create XPT files from SAS 9.4
/*** Create V5 XPT ***/
%LOC2XPT(FILESPEC='../output/stks_v5.xpt' , LIBREF=SASHELP, MEMLIST=STOCKS, FORMAT=V5)
/*** Create V8 XPT ***/
%LOC2XPT(FILESPEC='../output/stocks_v8.xpt' , LIBREF=SASHELP, MEMLIST=STOCKS, FORMAT=V8)
Check meta read in by read_xport()
import pyreadstat as prs
import sys
print(f"Python version: {sys.version}")
print(f"pyreadstat version: {prs.__version__}")
df_v5, meta_v5 = prs.read_xport("stks_v5.xpt")
df_v8, meta_v8 = prs.read_xport("stocks_v8.xpt")
try:
meta_v5.original_variable_types
meta_v8.original_variable_types
except AttributeError as err:
print(f"failure message: {err}")
prs.write_xport(df_v5,
"prs_v5.xpt",
table_name='PRS_STK5',
variable_format=meta_v5.original_variable_types,
column_labels=meta_v5.column_labels)
prs.write_xport(df_v8,
"prs_v8.xpt",
table_name='PRS_STK8',
variable_format=meta_v8.original_variable_types,
column_labels=meta_v8.column_labels)
Code works fine. One note about implicit formats in discussion, below
Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
pyreadstat version: 1.2.8
File example Above should suffice to reproduce?
Expected behavior README is up to date, and approach works.
Setup Information: See code output above
in case it helps, also updated to STOCKS instead of CLASS, as noted below
Thanks for the report. I think it is an error in the Readme, it should be meta.original_variable_types. Notice however that I think (as far as I remember from having seeing this before, but also I have to say it was some time ago), that in SAS the variable format must be set manually for it to be visible. In your files all of them are set to NULL right now. You can for comparison see sas.xpt5 and sas.xpt8 in test_data/basic in this repository, where the formats are set.
Indeed - My choice of test table was bad. I've updated snippets above to the SASHELP.STOCKS table (with explicit formats). meta.original_variable_types works!
One note about handling of NULL or implicit formats, mentioned above. Handling could be better, since $NULL. is not a native format:
- SAS 9.4 %LOC2XPT produced the first table (attached above)
- pyreadstat produced the second via code above (attached here)
prs_write_xport_stocks_v5_v8.zip
OK, regarding the string 'NULL', I have changed it to be a python None, I attach here the files produced with this variation, would you be so kind to check in the SAS viewer if now the formats look as expected?
Thank you! Looks good. The INformat outcomes (to match FORMATs) seem fair enough to me. I highlight simply for transparency. There could be scenarios when such and "inferred" INformat is not valid.
Thanks for checking!
I think the wrong formats probably come from Readstat, so it would need to be corrected there.
Just to understand, and to report correctly to readstat:
- Formats are from readstat
- Informats are somehow separately and specifically from readstat?
Informats seem particular to SAS/XPORT, so perhaps the root issue is that readstat does not support SAS informats (which are therefore simply inferred from the broadly relevant display formats).
I checked readstat and pyreadstat docs and find little to no mention specifically of INformats.
Ok sorry, let me elaborate on my answer:
Readstat has only one function to get the variable format when reading and one function to set the format when writing. So I am limited to use those functions to read and write the formats, since Readstat does not distinguish between informats and outformats, I cannot recover both or distinguish between both.
What I think is happening here is that Readstat is reading the Format (and not the Informat). So when you read the original file produced by SAS with pyreadstat you get the formats (not the informats), and then you pass them to the writing function, so now those Formats become the new Informat in the file produced by pyreadstat. I am assuming here Informat is the format of the variable in the input file and Format is the format is what SAS sets finally for the file ... In your original files the Informat looks like sas7bdat ones, while in the files written by pyreadstat the Informats look exactly the same as the Format in the original file.
To me it looks correct and makes perfect sense the way it is implemented, but if you would need to recover also the original Informats (to pass them when writing I guess), that would need somebody implementing reading and writing those in Readstat, because as mentioned before right now there is only one function to read and write formats and (apparently) that one is returning and setting the Format and not the Informats.
Digging a bit deeper I have found a couple of interesting things:
- There is a mention to both informats and formats in the Readstat file to read xport, so I assume it is reading both, however as mentioned before I do not see a way to recover the informats/formats separately. I do not see a similar mention in the writing file.
- There is this discussion on R Haven, where they mention that when writing both informats and formats are set, that would explain what we are seeing here. @gorcha did a PR to Readstat around this to make some improvements, so maybe gorcha if kind enough could explain more.
- Some years ago @reikoch did a PR to Readstat to implement reading or writing XPT8. I vaguely remember to have discussed him something about formats and informats, maybe he also knows something about the topic.
Hi guys - good to hear from you again! I did concentrate on labels, in particular dataset labels. Informats are such a SAS-specific thing that I think they "fall down the cracks", maybe they are read correctly, but as far as I see nothing is done with this info. ;-)
That's a helpful reference in Readstat file to read xport!
I redirected to readstat project, with an improvement request:
the 'NULL' to None is now available in version 1.2.9
evanmiller added readstat c functions
const char *readstat_variable_get_informat(const readstat_variable_t *variable);
void readstat_variable_set_informat(readstat_variable_t *variable, const char *informat);
I struggle to test that directly in my env. I think I could test that through updated pyreadstat package.
Sure, I am a bit busy right now, but will do as soon as I can
OK, I did the changes in the branch informat_dev, seems to work. You will be able to read the informats from metadata.original_variable_informats and write passing the argument variable_informat.
Notice that to write in version 5, you need to pass file_format_version=5.
import pyreadstat as prs
import sys
print(f"Python version: {sys.version}")
print(f"pyreadstat version: {prs.__version__}")
print(f"pyreadstat file: {prs.__file__}")
df_v5, meta_v5 = prs.read_xport("stks_v5.xpt")
df_v8, meta_v8 = prs.read_xport("stocks_v8.xpt")
try:
print("ori5", meta_v5.original_variable_types)
print("ori8", meta_v8.original_variable_types)
print("ori5", meta_v5.original_variable_informats)
print("ori8", meta_v8.original_variable_informats)
except AttributeError as err:
print(f"failure message: {err}")
prs.write_xport(df_v5,
"prs_v5.xpt",
file_format_version=5,
table_name='PRS_STK5',
variable_format=meta_v5.original_variable_types,
variable_informat=meta_v5.original_variable_informats,
column_labels=meta_v5.column_labels)
prs.write_xport(df_v8,
"prs_v8.xpt",
table_name='PRS_STK8',
variable_format=meta_v8.original_variable_types,
variable_informat=meta_v5.original_variable_informats,
column_labels=meta_v8.column_labels)
df_v52, meta_v52 = prs.read_xport("prs_v5.xpt")
df_v82, meta_v82 = prs.read_xport("prs_v8.xpt")
print(meta_v5.original_variable_types)
print(meta_v8.original_variable_types)
print("INFORMATS")
print(meta_v5.original_variable_informats)
print(meta_v8.original_variable_informats)
Thanks, once again. I'll check the informat_dev branch.
Looks great to me. Thank you!
Python version: 3.9.12 (main, Oct 5 2023, 23:34:09)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)]
pyreadstat version: 1.3.0
ori5 FMT {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'DOLLAR8.2', 'HIGH': 'DOLLAR8.2', 'LOW': 'DOLLAR8.2', 'CLOSE': 'DOLLAR8.2', 'VOLUME': 'COMMA12', 'ADJCLOSE': 'DOLLAR8.2'}
ori5 INF {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'BEST32', 'HIGH': 'BEST32', 'LOW': 'BEST32', 'CLOSE': 'BEST32', 'VOLUME': 'BEST32', 'ADJCLOSE': 'BEST32'}
ori8 FMT {'Stock': None, 'Date': 'DATE', 'Open': 'DOLLAR8.2', 'High': 'DOLLAR8.2', 'Low': 'DOLLAR8.2', 'Close': 'DOLLAR8.2', 'Volume': 'COMMA12', 'AdjClose': 'DOLLAR8.2'}
ori8 INF {'Stock': None, 'Date': 'DATE', 'Open': 'BEST32', 'High': 'BEST32', 'Low': 'BEST32', 'Close': 'BEST32', 'Volume': 'BEST32', 'AdjClose': 'BEST32'}
FORMATS (types)
prs5 FMT {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'DOLLAR8.2', 'HIGH': 'DOLLAR8.2', 'LOW': 'DOLLAR8.2', 'CLOSE': 'DOLLAR8.2', 'VOLUME': 'COMMA12', 'ADJCLOSE': 'DOLLAR8.2'}
prs8 FMT {'Stock': None, 'Date': 'DATE', 'Open': 'DOLLAR8.2', 'High': 'DOLLAR8.2', 'Low': 'DOLLAR8.2', 'Close': 'DOLLAR8.2', 'Volume': 'COMMA12', 'AdjClose': 'DOLLAR8.2'}
V5 FORMATS (type) roundtrip OK? True
V8 FORMATS (type) roundtrip OK? True
INFORMATS
prs5 INF {'STOCK': None, 'DATE': 'DATE', 'OPEN': 'BEST32', 'HIGH': 'BEST32', 'LOW': 'BEST32', 'CLOSE': 'BEST32', 'VOLUME': 'BEST32', 'ADJCLOSE': 'BEST32'}
prs8 INF {'Stock': None, 'Date': 'DATE', 'Open': 'BEST32', 'High': 'BEST32', 'Low': 'BEST32', 'Close': 'BEST32', 'Volume': 'BEST32', 'AdjClose': 'BEST32'}
V5 INFORMATS roundtrip OK? True
V8 INFORMATS roundtrip OK? True