-
Notifications
You must be signed in to change notification settings - Fork 0
Scala and SAS datasets
For anyone who's ever worked with SAS the parso library for java it quickly becomes a god send. It allows you to import compressed SAS datasets into whatever other storage you like freeing you form the confines of SAS. The downside of the parso library is that it's single threaded. During my time working with the parso library I quickly noticed it's performance downsides when reading really large SAS datasets. On a file with around 100 attributes (columns) it topped out around 50k rows per second. Reading in 100 mil records took around 30 minutes which for big data applications disqualifies it as something viable.
In my github project I try to rectify this situation by rewriting the parso library with scala making it natively thread safe. It should then be possible to read the SAS metadata and the rows in parallel greatly increasing the performance.
When the library is thread safe the next step is to parallize it
The last step is to implement the functional stream pattern from the cats library