Skip to content

Scala and SAS datasets

Tomasz Kaszuba edited this page Oct 30, 2018 · 3 revisions

Background

For anyone who's ever worked with SAS the parso library for java it quickly becomes a god send. It allows you to import compressed SAS datasets into whatever other storage you like freeing you form the confines of SAS. The downside of the parso library is that it's single threaded. During my time working with the parso library I quickly noticed it's performance downsides when reading really large SAS datasets. On a file with around 100 attributes (columns) it topped out around 50k rows per second. Reading in 100 mil records took around 30 minutes which for big data applications disqualifies it as something viable.

Objective

Port it

In my github project I try to rectify this situation by rewriting the parso library with scala making it natively thread safe. It should then be possible to read the SAS metadata and the rows in parallel greatly increasing the performance.

Parallize it

When the library is thread safe the next step is to parallize it

Monad it

The last step is to implement the functional stream pattern from the cats library