You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project started out as a port of Pandas' read_sas function. Since the first public release, several bugs have been fixed and additional features have been added e.g. reading a subset of columns. The goal is to have a fast reader that allows greater interoperability of Julia with the SAS ecosystem.
7
7
8
-
Only `sas7bdat` format is supported, however. If anyone needs to read `xport`formatted files, please create an issue or contribute/send me a pull request.
8
+
Only `sas7bdat` format is supported, however. If anyone needs to read `xport` files, please submit an issue. Pull requests are welcome as well.
9
9
10
10
## Installation
11
11
12
12
```
13
13
Pkg.add("SASLib")
14
14
```
15
15
16
-
## Examples
16
+
## Read Performance
17
+
18
+
I did benchmarking mostly on my Macbook Pro laptop. In general, the Julia implementation is somewhere between 7-25x faster than the Python counterpart. Test results are documented in the `test/perf_results_<version>` folders.
19
+
20
+
## User Guide
21
+
22
+
### Basic Use Case
17
23
18
-
Use the `readsas` function to read the file. The result is a dictionary of various information about the file as well as the data itself.
24
+
Use the `readsas` function to read a SAS7BDAT file. The result is a dictionary of various information about the file as well as the data itself.
19
25
20
26
```julia
21
27
julia>using SASLib
22
28
23
29
julia> x =readsas("productsales.sas7bdat")
24
-
Read data set of size 1440 x 10in2.0 seconds
25
-
Dict{Symbol,Any} with 16 entries:
30
+
Read productsales.sas7bdat with size 1440 x 10in1.05315 seconds
If you only need to read few columns, just pass an `include_columns` argument:
108
+
You may find the columns being mixed up a bit annoying since a regular Dict does not have any concept of orders and DataFrame just sort them aphabetically. To work around that issue, you can leverage `:column_symbols` array, which has the _natural order_ from the file:
109
+
110
+
```
111
+
julia> df =DataFrame(((c => x[:data][c]) for c in x[:column_symbols])...);
112
+
113
+
julia>head(df,5)
114
+
5×10 DataFrames.DataFrame
115
+
│ Row │ ACTUAL │ PREDICT │ COUNTRY │ REGION │ DIVISION │ PRODTYPE │ PRODUCT │ QUARTER │ YEAR │ MONTH │
SASLib.close(handler) # remember to close the handler when done
124
167
```
125
168
126
-
## Read Performance
169
+
Note that there is no facility at the moment to jump and read a subset of rows. Currently, SASLib always read from the beginning.
127
170
128
-
I don't have too much performance test results but initial comparison between SASLib.jl and Pandas on my Macbook Pro is encouraging. In general, the Julia implementation is somewhere between 4x to 7x faster than the Python counterpart. See the perf\_results\_* folders for test results related to the version being published.
171
+
### String Columns
172
+
173
+
By default, string columns are read into a special AbstractArray structure called ObjectPool in order to conserve memory space that might otherwise be wasted for duplicate string values. SASLib tries to be smart -- when it encounters too many unique values (> 10%) in a large array (> 2000 rows), it falls back to a regular Julia array.
174
+
175
+
You can use a different array type (e.g. [CategoricalArray](https://github.com/JuliaData/CategoricalArrays.jl) or [PooledArray](https://github.com/JuliaComputing/PooledArrays.jl)) for any columns as you wish by specifying a `string_array_fn` parameter when reading the file. This argument must be a Dict that maps a column symbol into a function that takes an integer argument and returns any array of that size.
176
+
177
+
Here's the normal case:
178
+
179
+
```
180
+
julia> x =readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION]);
181
+
Read productsales.sas7bdat with size 1440 x 2in0.00277 seconds
182
+
183
+
julia>typeof.(collect(values(x[:data])))
184
+
2-element Array{DataType,1}:
185
+
SASLib.ObjectPool{String,UInt16}
186
+
SASLib.ObjectPool{String,UInt16}
187
+
```
188
+
189
+
Now, you can force SASLib to use a regular array as such.
190
+
191
+
```
192
+
julia> x =readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
Read productsales.sas7bdat with size 1440 x 2in0.05009 seconds
195
+
196
+
julia>typeof.(collect(values(x[:data])))
197
+
2-element Array{DataType,1}:
198
+
Array{String,1}
199
+
SASLib.ObjectPool{String,UInt16}
200
+
```
201
+
202
+
For convenience, `SASLib.REGULAR_STR_ARRAY` could be used instead. In addition, if you need all columns to be configured then the key of the `string_array_fn` dict may be just the symbol `:_all_`.
203
+
204
+
```
205
+
julia> x =readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
Read productsales.sas7bdat with size 1440 x 2in0.01005 seconds
208
+
209
+
julia>typeof.(collect(values(x[:data])))
210
+
2-element Array{DataType,1}:
211
+
Array{String,1}
212
+
Array{String,1}
213
+
```
129
214
130
215
## Why another package?
131
216
132
-
At first, I was just going to use ReadStat.However, ReadStat does not support reading files with compressed binary data. I could have chosen to contribute to that project instead but I would rather learn and code in Julia ;-) The implementation in Pandas is fairly straightforward, making it a relatively easy porting project.
217
+
At first, I was just going to use [ReadStat.jl](https://github.com/davidanthoff/ReadStat.jl), which uses the [ReadStat C-library](https://github.com/WizardMac/ReadStat). However, ReadStat does not support reading RDC-compressed binary files. I could have chosen to contribute to that project but I would rather learn and code in Julia instead ;-) The implementation in Pandas is fairly straightforward, making it a relatively easy porting project.
133
218
134
219
## Porting Notes
135
220
136
-
I chose to copy the code from Pandas and made minimal changes so I can have a working version quickly. Hence, the code isn't very Julia-friendly e.g. variable and function naming are all mixed up. It is not a priority at this point but I would think some major refactoring would be required to make it more clean & performant.
221
+
I chose to copy the code from Pandas and made minimal changes so I can have a working version quickly. Hence, the code isn't very Julia-friendly e.g. variable and function naming are all mixed up. It is not a priority at this point but I would think some major refactoring would be required to clean up the code.
137
222
138
223
## Credits
139
224
140
-
Many thanks to Jared Hobbs, the original author of the SAS I/O code from Python Pandas. See LICENSE_SAS7BDAT.md for license details.
225
+
- Jared Hobbs, the author of the SAS reader code from Python Pandas. See LICENSE_SAS7BDAT.md.
226
+
- [Evan Miller](https://github.com/evanmiller), the author of ReadStat C/C++ library. See LICENSE_READSTAT.md.
227
+
- [David Anthoff](https://github.com/davidanthoff), who provide many valuable ideas at the early stage of development.
228
+
229
+
I also want to thank all the active members at the [Julia Discourse community] (https://discourse.julialang.org). This project wouldn't be possible without all the help I got from the community. That's the beauty of open-source development.
0 commit comments