pyreadstat Merge SPSS files, both data frame and metadata

Dataframes can be easily merged using pd.merge(); on the other hand, merge metadata is a pain. It would be great if it would be possible to have a pyreadstat.merge_sav() method with the same parameters of pd.merge plus metadata for left and right.

May 21 '21 17:05 pepcmarques

I have written some code that does this and I am happy to share this. However, am I right in thinking that left or right join's would surely only keep the left or right metadata?

Aug 31 '22 16:08 MDS-JAnthony

Hi @MDS-JAnthony, thanks for the update. Yes, it is my understanding cause after the left or right merge you will get only the left or right data.

Aug 31 '22 16:08 pepcmarques

I guess you are talking about the key column? That is one possibility. The other is try to keep everything anyway. What in the case of an inner join or outer join and the metadata in both dataframes differ? How do you reconciliate? For the non key columns, if they are repeated, you have to rename them as pandas will do when joining. I guess there are a lot of details to take into account for this task and decisions to make which may require different solutions on each case, and for that reason I am not much in favor of including this in pyreadstat, but of course you can share your solution here for others to use and modify it accordings to their needs.

Aug 31 '22 17:08 ofajardo

Hi @ofajardo, thank you for your comments. You are totally right, there are a lot of details. However including some extra parameters, something like main_meta='left' (or right). What if for the non-key columns, if they are repeated, keep them for another version?

Aug 31 '22 18:08 pepcmarques

I think for no key columns pandas rename them as column_x, column_y? If so, the same has to happen here. You would need to check all possible things AND write tests for every possible corner case.

Orherwise you could just share your code snippet for the people to use it as is. Then they can modify it if they need and no need to maintain and deal with user issues in the future.

Aug 31 '22 18:08 ofajardo

I will be tidying up my code for this today/tomorrow, so should be able to share an initial version of the code.

Sep 01 '22 08:09 MDS-JAnthony

Here is my initial code.

Hopefully people can modify this to their needs.

Any quetions just ask.

class MergeSPSSData():
    """
    Merge meta data class
    This class assumes that if a column name and column label are the same, then they are the
            same question
    For questions that have the same name but different label, these shall be renamed to _x for the
        left_df, left_meta and _y for the right_df, right_meta
    """

    def __init__(self, left_df, right_df, left_meta, right_meta, how):
        """
        :param left_df (pd.DataFrame)
        :param right_df (pd.DataFrame)
        :param left_meta (metadata container)
        :param right_meta (metadata container)
        :param how (string) Must be one of 'left', 'right', 'inner', 'outer'.
        """
        if left_df is None:
            raise ValueError('left_df cannot be empty')
        if right_df is None:
            raise ValueError('right_df cannot be empty')
        if left_meta is None:
            raise ValueError('left_meta cannot be empty')
        if right_meta is None:
            raise ValueError('right_meta cannot be empty')
        if how not in ['left', 'right', 'inner', 'outer']:
            raise ValueError(f'how must be one of left, right, inner or outer, {how}')
        self.left_meta = left_meta
        self.right_meta = right_meta
        self.left_df = left_df
        self.right_df = right_df
        self.how = how
        # list to hold questions that have the same column name and label
        self.same_cols = []
        # list to hold questions that have the same column name but different label
        self.different_cols = {}
        self.left_df_rename_col_dict = {}
        self.right_df_rename_col_dict = {}
        # determine which columns are the same/different
        self._check_cols()
        # rename columns that share same name but different labels
        self._rename_variables()
        # calculate length of columns
        self.len_col_names_left_meta = len(left_meta.column_names)
        self.len_col_names_right_meta = len(right_meta.column_names)
        self.len_col_labels_left_meta = len(left_meta.column_labels)
        self.len_col_labels_right_meta = len(right_meta.column_labels)

    def _check_dicts(self, dict_1, dict_2):
        """
        Method to check if two dicts can be merged and that no key values pairs don't match
        :param dict_1 (dict)
        :param dict_2 (dict)
        """
        for key in dict_1:
            # TODO: the value in the dict could be a type other than string
            if (key in dict_2) and (dict_1[key] != dict_2[key]):
                logging.info(f'{key}: {dict_1[key]}, overlaps with {dict_2[key]}')
                raise RuntimeError(f'{dict_1}, {dict_2} have an overlapping key value pair, {key}')

    def _check_cols(self):
        """
        Method to determine which columns are the same and which are different
        This class assumes that if a column name and column label are the same, then they are the
            same question
        """
        tmp_keys_in_both = []
        for key in self.left_meta.column_names_to_labels:
            if key in self.right_meta.column_names_to_labels:
                tmp_keys_in_both.append(key)
                if self.left_meta.column_names_to_labels[key] == self.right_meta.column_names_to_labels[key]:
                    self.same_cols.append(key)
                elif self.left_meta.column_names_to_labels[key] != self.right_meta.column_names_to_labels[key]:
                    self.different_cols[key] = 'l'
                elif self.right_meta.column_names_to_labels[key] != self.left_meta.column_names_to_labels[key]:
                    self.different_cols[key] = 'r'

    def _rename_variables(self):
        """
        Method to rename variables in both dataframes and meta data
        """
        for key, value in self.different_cols.items():
            self.left_df.rename(
                columns={key: key + '_x'},
                inplace=True)
            self.left_df.rename(
                columns={key: key + '_y'},
                inplace=True)
            # metadata
            self._update_name_metadata(
                col=key,
                l_or_r=value)

    def _update_name_metadata(self, col, l_or_r):
        """
        Method to update col to col_x or col_y
        :param col (string) name of variable to rename in metadata
        :param l_or_r (string) must be one of l or r for left or right.
        """
        if l_or_r not in ['l', 'r']:
            raise ValueError(f'l_or_r must be one of l or r, {l_or_r}')
        if l_or_r == 'l':
            tmp_str = '_x'
        elif l_or_r =='r':
            tmp_str = '_y'
        # column name
        self.left_meta.column_names.remove(col)
        self.left_meta.column_names.append(col + tmp_str)
        # column_names_to_labels
        tmp_value = self.left_meta.column_names_to_labels[col]
        self.left_meta.column_names_to_labels.pop(col)
        self.left_meta.column_names_to_labels[col + tmp_str] = tmp_value
        # variable_value_labels
        if col in self.left_meta.variable_value_labels:
            tmp_dict = self.left_meta.variable_value_labels[col]
            self.left_meta.variable_value_labels.pop(col)
            self.left_meta.variable_value_labels[col + tmp_str] = tmp_dict
        # variable_to_label
        if col in self.left_meta.variable_to_label:
            tmp_value = self.left_meta.variable_to_label[col]
            self.left_meta.variable_to_label.pop(col)
            self.left_meta.variable_to_label[col + tmp_str] = tmp_value
        # readstat_variable_types
        tmp_value = self.left_meta.readstat_variable_types[col]
        self.left_meta.readstat_variable_types.pop(col)
        self.left_meta.readstat_variable_types[col + tmp_str] = tmp_value
        # variable_alignment
        tmp_value = self.left_meta.variable_alignment[col]
        self.left_meta.variable_alignment.pop(col)
        self.left_meta.variable_alignment[col + tmp_str] = tmp_value
        # variable_storage_width
        tmp_value = self.left_meta.variable_storage_width[col]
        self.left_meta.variable_storage_width.pop(col)
        self.left_meta.variable_storage_width[col + tmp_str] = tmp_value
        # variable_display_width
        tmp_value = self.left_meta.variable_display_width[col]
        self.left_meta.variable_display_width.pop(col)
        self.left_meta.variable_display_width[col + tmp_str] = tmp_value
        # variable_measure
        tmp_value = self.left_meta.variable_measure[col]
        self.left_meta.variable_measure.pop(col)
        self.left_meta.variable_measure[col + tmp_str] = tmp_value

    def merge_metadata(self):
        """
        Method to merge the meta data for SPSS files
        :return metadata
        """
        # The following variables need to be updated in the meta data container
        # List of what has to be updated in meta data container, there other variables but not
        #   affected by adding a new variable
        # column_names : a list with the names of the columns.
        # column_labels : a list with the column labels, if any.
        # column_names_to_labels : a dictionary with column_names as keys and column_labels as
        #   values
        # number_columns : an int with the number of columns
        # variable_value_labels : a dict with keys being variable names, and values being a dict
        #   with values as keys and labels as values. It may be empty if the dataset did not
        #   contain such labels. For sas7bdat files it will be empty unless a sas7bcat was given.
        #   It is a combination of value_labels and variable_to_label.
        # value_labels : a dict with label name as key and a dict as value, with values as keys and
        #   labels as values. In the case of parsing a sas7bcat file this is where the formats are.
        # variable_to_label : A dict with variable name as key and label name as value. Label names
        #   are those described in value_labels. Sas7bdat files may have this member populated and
        #   its information can be used to match the information in the value_labels coming from the
        #   sas7bcat file.
        # original_variable_types : a dict of variable name to variable format in the original file.
        #   For debugging purposes.
        # readstat_variable_types : a dict of variable name to variable type in the original file as
        #   extracted by Readstat.i For debugging purposes. In SAS and SPSS variables will be either
        #   double (numeric in the original app) or string (character). Stata has in addition int8,
        #   int32 and float types.
        # variable_alignment: a dict with keys being variable names and values being the display
        #   alignment: left, center, right or unknown
        # variable_storage_width: a dict with keys being variable names and values being the
        #   storage width
        # variable_display_width: a dict with keys being variable names and values being the
        #   display width
        # variable_measure: a dict with keys being variable names and values being the measure:
        #   nominal, ordinal, scale or unknown
        # ------------------------------------------------------------------------------------------
        # Generate empty metadata
        metadata = None
        # ------------------------------------------------------------------------------------------
        # how = 'outer' keep all meta data from both sources
        if self.how == 'outer':
            # set initial metadata as left_meta
            metadata = self.left_meta
            # update columns_names (list)
            metadata.column_names = self.left_meta.column_names.copy() + \
                list(set(self.right_meta.column_names.copy()) - set(self.left_meta.column_names.copy()))
            if len(metadata.column_names) < max(self.len_col_names_left_meta, self.len_col_labels_right_meta):
                raise RuntimeError(
                    'Merge meta data column names has resulted in a new list shorter than expected')
            # update column_labels (list)
            metadata.column_labels = self.left_meta.column_labels + \
                list(set(self.right_meta.column_labels) - set(self.left_meta.column_labels))
            if len(metadata.column_labels) < max(self.len_col_labels_left_meta, self.len_col_labels_right_meta):
                raise RuntimeError(
                    'Merge meta data column labels has resulted in a new list shorter than expected')
            # update column_names_to_labels (dict)
            for key in self.right_meta.column_names_to_labels:
                if key not in metadata.column_names_to_labels:
                    metadata.column_names_to_labels[key] = self.right_meta.column_names_to_labels[key]
            # update number_columns
            metadata.number_columns = len(metadata.column_names)
            # variable_value_labels (dicts)
            for key in self.right_meta.variable_value_labels:
                if key not in metadata.variable_value_labels:
                    metadata.variable_value_labels[key] = self.right_meta.variable_value_labels[key]
                # update dicts if they are not the same
                if metadata.variable_value_labels[key] != self.right_meta.variable_value_labels[key]:
                    # check that you can merge dicts
                    self._check_dicts(
                        metadata.variable_value_labels[key],
                        self.right_meta.variable_value_labels[key])
                    # No runtime error raised, so can merge dicts safely
                    metadata.variable_value_labels[key].update(self.right_meta.variable_value_labels[key])
            # update original_variable_types
            # TODO: original_variable_types
            # update readstat_variable_types
            for key in self.right_meta.readstat_variable_types:
                if key not in metadata.readstat_variable_types:
                    metadata.readstat_variable_types[key] = self.right_meta.readstat_variable_types[key]
            # update variable_alignment
            for key in self.right_meta.variable_alignment:
                if key not in metadata.variable_alignment:
                    metadata.variable_alignment[key] = self.right_meta.variable_alignment[key]
            # update variable_storage_width
            for key in self.right_meta.variable_storage_width:
                if key not in metadata.variable_storage_width:
                    metadata.variable_storage_width[key] = self.right_meta.variable_storage_width[key]
            # update variable_display_width
            for key in self.right_meta.variable_display_width:
                if key not in metadata.variable_display_width:
                    metadata.variable_display_width[key] = self.right_meta.variable_display_width[key]
            # update variable_measure
            for key in self.right_meta.variable_measure:
                if key not in metadata.variable_measure:
                    metadata.variable_measure[key] = self.right_meta.variable_measure[key]
        elif self.how == 'inner':
            # set initial metadata as left_meta
            metadata = self.left_meta
            # update columns_names (list)
            metadata.column_names = list(set(self.right_meta.column_names) & set(self.left_meta.column_names))
            if len(metadata.column_names) > min(len(self.right_meta.column_names), len(self.left_meta.column_names)):
                raise RuntimeError(
                    'Merge meta data column names has resulted in a new list longer than expected')
            # update column_labels (list)
            metadata.column_labels = list(set(self.right_meta.column_labels) & set(self.left_meta.column_labels))
            if len(metadata.column_labels) > min(len(self.right_meta.column_labels), len(self.left_meta.column_labels)):
                raise RuntimeError(
                    'Merge meta data column labels has resulted in a new list longer than expected')
            # update column_names_to_labels (dict)
            for key in self.right_meta.column_names_to_labels:
                if key not in metadata.column_names_to_labels:
                    metadata.column_names_to_labels.pop(key)
            # update number_columns
            metadata.number_columns = len(metadata.column_names)
            # variable_value_labels (dicts)
            for key in self.right_meta.variable_value_labels:
                if key not in metadata.variable_value_labels:
                    metadata.variable_value_labels.pop(key)
                # update dicts if they are not the same
                if metadata.variable_value_labels[key] != self.right_meta.variable_value_labels[key]:
                    # check that you can merge dicts
                    self._check_dicts(
                        metadata.variable_value_labels[key],
                        self.right_meta.variable_value_labels[key])
                    # No runtime error raised, so can merge dicts safely
                    metadata.variable_value_labels[key].update(self.right_meta.variable_value_labels[key])
            # update original_variable_types
            # TODO: original_variable_types
            # update readstat_variable_types
            for key in self.right_meta.readstat_variable_types:
                if key not in metadata.readstat_variable_types:
                    metadata.readstat_variable_types.pop(key)
            # update variable_alignment
            for key in self.right_meta.variable_alignment:
                if key not in metadata.variable_alignment:
                    metadata.variable_alignment.pop(key)
            # update variable_storage_width
            for key in self.right_meta.variable_storage_width:
                if key not in metadata.variable_storage_width:
                    metadata.variable_storage_width.pop(key)
            # update variable_display_width
            for key in self.right_meta.variable_display_width:
                if key not in metadata.variable_display_width:
                    metadata.variable_display_width.pop(key)
            # update variable_measure
            for key in self.right_meta.variable_measure:
                if key not in metadata.variable_measure:
                    metadata.variable_measure.pop(key)
        elif self.how == 'left':
            # set initial metadata as left_meta
            metadata = self.left_meta
        elif self.how == 'right':
            # set initial metadata as left_meta
            metadata = self.right_meta
        return metadata

    def merge_spss_data(self):
        """
        Method to merge the dataframes
        """
        # use pd.merge
        survey_df = pd.merge(
            self.left_df,
            self.right_df,
            how=self.how,
            on=self.same_cols)
        if (self.how == 'outer') and (len(self.left_df) + len(self.right_df)) != len(survey_df):
            raise RuntimeError('The new dataframe is not the length of the input dataframes added together')
        elif (self.how == 'left') and len(self.left_df) != len(survey_df):
            raise RuntimeError('The new dataframe is not the length of the input dataframes added together')
        elif (self.how == 'right') and len(self.left_df) != len(survey_df):
            raise RuntimeError('The new dataframe is not the length of the input dataframes added together')
        return survey_df

Sep 05 '22 13:09 MDS-JAnthony

I did this:

import pandas as pd
import pyreadstat


def observe_metadata(metadata):
    print("==================================== ENTER: OBSERVE METADATA ====================================")
    # Metadata nesnesinin içeriğini daha ayrıntılı görmek için pprint kullanın.
    # pprint.pprint(metadata.__dict__)
    print("Sütun (Değişken) adları:")
    print(metadata.column_names)
    # print("Sütun (Değişken) adları ve etiketleri:")
    # print(metadata.column_names_to_labels)
    print("Satır sayısı: ", metadata.number_rows)
    print("Sütun sayısı:", metadata.number_columns)
    print("Dosya kodlama biçimi: ", metadata.file_encoding)
    print("==================================== OUT: OBSERVE METADATA ====================================")


# PISA 2022 veri seti dosyasının yolu:
file_path_stu = 'CY08MSP_STU_QQQ_Türkiye_and_filtered_variables.sav'
file_path_sch = 'CY08MSP_SCH_QQQ_Türkiye_and_filtered_variables.sav'

# Filtrelenmiş veriyi .sav dosyası olarak kaydetmek için yol belirt.
output_file_path = 'CY08MSP_STU_and_SCH_QQQ_Türkiye_and_filtered_variables_merged_4.sav'

on_column = 'CNTSCHID'

# PISA 2022 verisininin metadatasını oku.
df_stu, meta_stu = pyreadstat.read_sav(
    file_path_stu, metadataonly=True, encoding="UTF-8",)
df_sch, meta_sch = pyreadstat.read_sav(
    file_path_sch, metadataonly=True, encoding="UTF-8",)

print("ÖĞRENCİ ORİJİNAL VERİ:")
observe_metadata(meta_stu)
print("OKUL ORİJİNAL VERİ:")
observe_metadata(meta_sch)

common_columns = []
# Find common columns based on both column names and labels
for column_name, label in meta_stu.column_names_to_labels.items():
    if column_name in meta_sch.column_names_to_labels.keys() and meta_sch.column_names_to_labels[column_name] == label:
        common_columns.append(column_name)

print("Ortak sütunlar: ", common_columns)
print("Ortak sütun sayısı: ", len(common_columns))
common_columns.remove(on_column)
print("Ortak sütunlar: ", common_columns)
print("Ortak sütun sayısı: ", len(common_columns))

selected_columns = list(meta_sch.column_names_to_labels.keys())

for common_column in common_columns:
    selected_columns.remove(common_column)

# Load the two datasets
student_data, student_meta = pyreadstat.read_sav(
    file_path_stu, encoding="UTF-8")
school_data, school_meta = pyreadstat.read_sav(
    file_path_sch, usecols=selected_columns, encoding="UTF-8")


# Merge the datasets
merged_data = student_data.merge(school_data, on=on_column, how="outer")

# Set metadata for merged_data
merged_meta = student_meta

for variable in school_meta.column_names_to_labels.keys():
    merged_meta.column_names_to_labels[variable] = school_meta.column_names_to_labels[variable]

for variable in school_meta.variable_value_labels.keys():
    merged_meta.variable_value_labels[variable] = school_meta.variable_value_labels[variable]

print(merged_meta.column_names_to_labels)

# Save the merged dataset with metadata
pyreadstat.write_sav(merged_data, output_file_path, column_labels=merged_meta.column_names_to_labels,
                     variable_value_labels=merged_meta.variable_value_labels)

df_output, meta_output = pyreadstat.read_sav(
    output_file_path, metadataonly=True, encoding="UTF-8",)


print("==================================== ENTER: CHECK THE PROCESS ====================================")
calculated_col_count = meta_stu.number_columns + \
    meta_sch.number_columns - len(common_columns) - 1
print(
    f"Hesaplanan sütun sayısı: stu_col_count + stu_col_count - common_col_count - 1 = {meta_stu.number_columns} + {meta_sch.number_columns} - {len(common_columns)} - 1 = {calculated_col_count}")
print("Çıktı verisindeki sütun sayısı: ", meta_output.number_columns)
print(
    f"{'Değişken sayısı kontrolü: İşlem başarılı. ' + str(calculated_col_count) + ' == ' + str(meta_output.number_columns) if calculated_col_count == meta_output.number_columns else 'Değişken sayısı kontrolü: İşlem başarısız. ' + str(calculated_col_count) + ' != ' + str(meta_output.number_columns)}")
print(f"Filtrelenmiş veri {output_file_path} dosyasına kaydedildi.")
print("==================================== OUT: CHECK THE PROCESS ====================================")


print("ÇIKTI VERİSİ:")
observe_metadata(meta_output)

Result:

ÖĞRENCİ ORİJİNAL VERİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'ESCS', 'MEAN_PVMATH']
Satır sayısı:  7250
Sütun sayısı: 5
Dosya kodlama biçimi:  UTF-8
==================================== OUT: OBSERVE METADATA ====================================
OKUL ORİJİNAL VERİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'MEAN_ESCS']
Satır sayısı:  196
Sütun sayısı: 4
Dosya kodlama biçimi:  UTF-8
==================================== OUT: OBSERVE METADATA ====================================
Ortak sütunlar:  ['CNT', 'CNTRYID', 'CNTSCHID']
Ortak sütun sayısı:  3
Ortak sütunlar:  ['CNT', 'CNTRYID']
Ortak sütun sayısı:  2
{'CNT': 'Country code 3-character', 'CNTRYID': 'Country Identifier', 'CNTSCHID': 'Intl. School ID', 'ESCS': 'Index of economic, social and cultural status', 'MEAN_PVMATH': 'Mean of the\xa0Plausible Values in Mathematics', 'MEAN_ESCS': 'Mean index of economic, social and cultural status (ESCS) of schools by Intl. School ID (CNTSCHID)'}
==================================== ENTER: CHECK THE PROCESS ====================================
Hesaplanan sütun sayısı: stu_col_count + stu_col_count - common_col_count - 1 = 5 + 4 - 2 - 1 = 6
Çıktı verisindeki sütun sayısı:  6
Değişken sayısı kontrolü: İşlem başarılı. 6 == 6
Filtrelenmiş veri CY08MSP_STU_and_SCH_QQQ_Türkiye_and_filtered_variables_merged_4.sav dosyasına kaydedildi.
==================================== OUT: CHECK THE PROCESS ====================================
ÇIKTI VERİSİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'ESCS', 'MEAN_PVMATH', 'MEAN_ESCS']
Satır sayısı:  7250
Sütun sayısı: 6
Dosya kodlama biçimi:  UTF-8
==================================== OUT: OBSERVE METADATA ====================================

Dec 23 '23 21:12 ertugruluyar-github