SASLib.jl

The goal is to have a fast reader that allows greater interoperability of Julia with the SAS ecosystem. Only sas7bdat format is supported. SASLib is licensed under the MIT Expat license.

Notes: SASLib v0.5.0 coontains a major breaking change in the results being returned from the readsas or SASLib.read functions. Please read documentation below about how to use the new SASLib.ResultSet object.

Installation

Pkg.add("SASLib")

Read Performance

I did benchmarking mostly on my Macbook Pro laptop. In general, the Julia implementation is somewhere between 10-100x faster than the Python Pandas. Test results are documented in the test/perf_results_<version> folders.

User Guide

julia> using SASLib

Reading SAS Files

Use the readsas function to read a SAS7BDAT file.

julia> rs = readsas("productsales.sas7bdat")
Read productsales.sas7bdat with size 1440 x 10 in 0.00256 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
⋮

Accessing Results

There are several ways to access the data conveniently without using any third party packages. Each cell value may be retrieved directly via the regular [i,j] index. Accessing an entire row or column returns a tuple and a vector respectively.

julia> rs[4,2]
533.0

julia> rs[4, :PREDICT]
533.0

julia> rs[1]
(925.0, 850.0, "CANADA", "EAST", "EDUCATION", "FURNITURE", "SOFA", 1.0, 1993.0, 1993-01-01)

julia> rs[:ACTUAL]
1440-element Array{Float64,1}:
 925.0
 999.0
 608.0
 ⋮

You may access a portion of the result set by indexing with unit range and column symbols.

julia> rs[2:4]
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
2: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
3: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01


julia> rs[:ACTUAL, :PREDICT, :YEAR, :MONTH]
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 1993.0, 1993-05-01
⋮

Assignments

You may assign values at the cell level. Doing so has a side effect.

julia> srs = rs[:ACTUAL, :PREDICT, :YEAR, :MONTH][1:2]
SASLib.ResultSet (2 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01

julia> srs[2,2] = 3
3

julia> rs[1:2]
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 3.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01

Iteration

ResultSet implements the usual Base.iteration interface, so it's easy to walk through the results:

julia> for r in rs
         r[1] < 10 && println(r)
       end
(5.0, 425.0, "CANADA", "EAST", "CONSUMER", "FURNITURE", "SOFA", 1.0, 1993.0, 1993-01-01)
(8.0, 957.0, "CANADA", "EAST", "CONSUMER", "FURNITURE", "SOFA", 3.0, 1993.0, 1993-07-01)
(6.0, 214.0, "CANADA", "WEST", "EDUCATION", "OFFICE", "TABLE", 3.0, 1993.0, 1993-08-01)
(9.0, 814.0, "CANADA", "WEST", "EDUCATION", "OFFICE", "TABLE", 4.0, 1993.0, 1993-11-01)
(3.0, 405.0, "CANADA", "WEST", "CONSUMER", "FURNITURE", "SOFA", 2.0, 1994.0, 1994-05-01)
(3.0, 938.0, "GERMANY", "WEST", "EDUCATION", "FURNITURE", "BED", 1.0, 1993.0, 1993-03-01)
(4.0, 975.0, "U.S.A.", "EAST", "EDUCATION", "FURNITURE", "SOFA", 1.0, 1993.0, 1993-01-01)
(9.0, 134.0, "U.S.A.", "EAST", "EDUCATION", "FURNITURE", "BED", 4.0, 1993.0, 1993-12-01)
(6.0, 915.0, "U.S.A.", "WEST", "EDUCATION", "OFFICE", "DESK", 2.0, 1993.0, 1993-04-01)

ResultSet Metadata

There are simple functions to retrieve meta information about a ResultSet.

names(rs)
columns(rs)
size(rs)

Conversion to DataFrame

It may be beneficial to convert the result set to DataFrame for more complex queries and manipulations.

julia> df = DataFrame(columns(rs), names(rs));

julia> head(df)
6×10 DataFrames.DataFrame
│ Row │ ACTUAL │ PREDICT │ COUNTRY │ REGION │ DIVISION  │ PRODTYPE  │ PRODUCT │ QUARTER │ YEAR   │ MONTH      │
├─────┼────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼────────┼────────────┤
│ 1   │ 925.0  │ 850.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0 │ 1993-01-01 │
│ 2   │ 999.0  │ 297.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0 │ 1993-02-01 │
│ 3   │ 608.0  │ 846.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0 │ 1993-03-01 │
│ 4   │ 642.0  │ 533.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0 │ 1993-04-01 │
│ 5   │ 656.0  │ 646.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0 │ 1993-05-01 │
│ 6   │ 948.0  │ 486.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0 │ 1993-06-01 │

Inclusion/Exclusion of Columns

Column Inclusion

It is always faster to read only the columns that you need. The include_columns argument comes in handy:

julia> rs = readsas("productsales.sas7bdat", include_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 4 in 0.00151 seconds
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PRODUCT, 3:YEAR, 4:MONTH
1: 925.0, SOFA, 1993.0, 1993-01-01
2: 999.0, SOFA, 1993.0, 1993-02-01
3: 608.0, SOFA, 1993.0, 1993-03-01
4: 642.0, SOFA, 1993.0, 1993-04-01
5: 656.0, SOFA, 1993.0, 1993-05-01
⋮

Column Exclusion

Likewise, you can read all columns except the ones you don't want as specified in exclude_columns argument:

julia> rs = readsas("productsales.sas7bdat", exclude_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 6 in 0.00265 seconds
SASLib.ResultSet (1440 rows x 6 columns)
Columns 1:PREDICT, 2:COUNTRY, 3:REGION, 4:DIVISION, 5:PRODTYPE, 6:QUARTER
1: 850.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
2: 297.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
3: 846.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
4: 533.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
5: 646.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
⋮

Case Sensitivity and Column Number

Column symbols are matched in a case insensitive manner with SAS column names.

Both include_columns and exclude_columns accept column number. In fact, you can mixed column symbols and column numbers as such:

julia> readsas("productsales.sas7bdat", include_columns=[:actual, :predict, 8, 9, 10])
Read productsales.sas7bdat with size 1440 x 5 in 0.16378 seconds
SASLib.ResultSet (1440 rows x 5 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:QUARTER, 4:YEAR, 5:MONTH
1: 925.0, 850.0, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 2.0, 1993.0, 1993-05-01
⋮

Incremental Reading

If you need to read files incrementally, you can use the SASLib.open function to obtain a handle of the file. Then, use the SASLib.read function to fetch a number of rows. Remember to close the handler with SASLib.close to avoid memory leak.

julia> handler = SASLib.open("productsales.sas7bdat")
SASLib.Handler[productsales.sas7bdat]

julia> rs = SASLib.read(handler, 2)
Read productsales.sas7bdat with size 2 x 10 in 0.06831 seconds
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01

julia> rs = SASLib.read(handler, 3)
Read productsales.sas7bdat with size 3 x 10 in 0.00046 seconds
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
2: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
3: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01

julia> SASLib.close(handler)

Note that there is no facility at the moment to jump and read a subset of rows. Currently, SASLib always read from the beginning.

String Column Constructor

By default, string columns are read into a special AbstractArray structure called ObjectPool in order to conserve memory space that might otherwise be wasted for duplicate string values. SASLib tries to be smart -- when it encounters too many unique values (> 10%) in a large array (> 2000 rows), it falls back to a regular Julia array.

You can use a different array type (e.g. CategoricalArray or PooledArray) for any columns as you wish by specifying a string_array_fn parameter when reading the file. This argument must be a Dict that maps a column symbol into a function that takes an integer argument and returns any array of that size.

Here's the normal case:

julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION]);
Read productsales.sas7bdat with size 1440 x 2 in 0.00193 seconds

julia> typeof.(columns(rs))
2-element Array{DataType,1}:
 SASLib.ObjectPool{String,UInt16}
 SASLib.ObjectPool{String,UInt16}

If you really want a regular String array, you can force SASLib to do so as such:

julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
                    string_array_fn=Dict(:COUNTRY => (n)->fill("",n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00333 seconds

julia> typeof.(columns(rs))
2-element Array{DataType,1}:
 Array{String,1}                 
 SASLib.ObjectPool{String,UInt16}

For convenience, SASLib.REGULAR_STR_ARRAY is a function that does exactly that. In addition, if you need all columns to be configured the same then the key of the string_array_fn dict may be just the symbol :_all_.

julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
                    string_array_fn=Dict(:_all_ => REGULAR_STR_ARRAY));
Read productsales.sas7bdat with size 1440 x 2 in 0.00063 seconds

julia> typeof.(columns(rs))
2-element Array{DataType,1}:
 Array{String,1}
 Array{String,1}

Numeric Columns Constructor

In general, SASLib allocates native arrays when returning numerical column data. However, you can provide a custom constructor so you would be able to either preallcoate the array or construct a different type of array. The number_array_fn parameter is a Dict that maps column symbols to the custom constructors. Similar to string_array_fn, this Dict may be specified with a special symbol :_all_ to indicate such constructor be used for all numeric columns.

Example - create SharedArray:

julia> rs = readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT], 
                          number_array_fn=Dict(:ACTUAL => (n)->SharedArray{Float64}(n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00385 seconds

julia> typeof.(columns(rs))
2-element Array{DataType,1}:
 SharedArray{Float64,1}
 Array{Float64,1}

Example - preallocate arrays:

julia> A = zeros(1440, 2);

julia> f1(n) = @view A[:, 1];

julia> f2(n) = @view A[:, 2];

julia> readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT], 
               number_array_fn=Dict(:ACTUAL => f1, :PREDICT => f2));
Read productsales.sas7bdat with size 1440 x 2 in 0.00041 seconds

julia> A[1:5,:]
5×2 Array{Float64,2}:
 925.0  850.0
 999.0  297.0
 608.0  846.0
 642.0  533.0
 656.0  646.0

File Metadata

You may obtain meta data for a SAS data file using the metadata function.

julia> md = metadata("productsales.sas7bdat")
File: productsales.sas7bdat (1440 x 10)
1:ACTUAL(Float64)                5:DIVISION(String)               9:YEAR(Float64)
2:PREDICT(Float64)               6:PRODTYPE(String)               10:MONTH(Date/Missings.Missing)
3:COUNTRY(String)                7:PRODUCT(String) 
4:REGION(String)                 8:QUARTER(Float64)

It's OK to access the fields directly.

julia> fieldnames(SASLib.Metadata)
9-element Array{Symbol,1}:
 :filename   
 :encoding   
 :endianness 
 :compression
 :pagesize   
 :npages     
 :nrows      
 :ncols      
 :columnsinfo

julia> md = metadata("test3.sas7bdat");

julia> md.compression
:RDC

Related Packages

ReadStat.jl uses the ReadStat C-library. However, ReadStat-C does not support reading RDC-compressed binary files.

StatFiles.jl is a higher-level package built on top of ReadStat.jl and implements the FileIO interface.

Python Pandas package has an implementation of SAS file reader that SASLib borrows heavily from.

Credits

Jared Hobbs, the author of the SAS reader code from Pandas. See LICENSE_SAS7BDAT.md.
Evan Miller, the author of ReadStat C/C++ library. See LICENSE_READSTAT.md.
David Anthoff, who provided many valuable ideas at the early stage of development.
Tyler Beason
susabi

I also want to thank all the active members at the Julia Discourse community. This project wouldn't be possible without all the help I got from the community. That's the beauty of open-source development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SASLib.jl

Installation

Read Performance

User Guide

Reading SAS Files

Accessing Results

Assignments

Iteration

ResultSet Metadata

Conversion to DataFrame

Inclusion/Exclusion of Columns

Incremental Reading

String Column Constructor

Numeric Columns Constructor

File Metadata

Related Packages

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

SASLib.jl

Installation

Read Performance

User Guide

Reading SAS Files

Accessing Results

Assignments

Iteration

ResultSet Metadata

Conversion to DataFrame

Inclusion/Exclusion of Columns

Incremental Reading

String Column Constructor

Numeric Columns Constructor

File Metadata

Related Packages

Credits