SAS Datasets not only contain data but also a whole lot of metadata information within. The most commonly used information from this metadata is as follows -
- Variable Name
- Variable Label
- Variable Type (num/char)
- Format
- Variable Position
- Length
In SAS, typically the PROC CONTENTS procedure is used by programmers to extract out this dataset level and column/variable level metadata; and it look something like this -
proc contents data = mycas.cars;
run;
If you are trying to read this information using Python, then you have a couple of options, but here we will focus on an amazing Python library called pyreadstat
. Some quick brief about pyreadstat
About pyreadstat
Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Let's follow a few steps to understand how pyreadstat
can be used to read SAS dataset metadata in a way that the output looks similar to the SAS Proc Contents output.
Step 1.
Install pyreadstat
on your computer, if you haven't already.
pip install pyreadstat pandas
Step 2.
Start a new program metadata.py
and import this library into the program.
import pyreadstat as prs
import pandas as pd
Step 3.
Invoke the .read_sas7bdat()
function from the prs
object which reads the SAS dataset (data as well as metadata)
ae001, ae001_meta = prs.read_sas7bdat("c:/ae001.sas7bdat")
Step 4.
Initialise an empty pandas dataframe and assign each piece of information as below.
# initalise empty pandas dataframe
df_metadata = pd.DataFrame()
# read column name, labels into the new pandas dataframe
df_metadata["name"] = meta.column_names
df_metadata["label"] = meta.column_labels
Step 5.
Follow the same steps as above to read the remaining information -
- format << meta.original_variable_types
- type << meta.readstat_variable_types
- length << meta.variable_storage_width
This website gives very good detail about the functions and parameters available in pyreadstat
to read SAS datasets.
Hope you found this article useful. Happy coding!