Skip to content

Commit 9208234

Browse files
authored
Merge pull request #16 from tk3369/develop
Bug Fixes/Performance Improvements Bundle
2 parents a865410 + 68cbba4 commit 9208234

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1358
-553
lines changed

.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
*.jl.*.cov
33
*.jl.mem
44
**/.ipynb_checkpoints/*
5-
**/*.swp
5+
**/*.swp
6+
**/*.log

LICENSE_READSTAT.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Copyright (c) 2013-2016 Evan Miller (except where otherwise noted)
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in
11+
all copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19+
THE SOFTWARE.

README.md

+117-28
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,51 @@
33
[![Build Status](https://travis-ci.org/tk3369/SASLib.jl.svg?branch=master)](https://travis-ci.org/tk3369/SASLib.jl)
44
[![codecov.io](http://codecov.io/github/tk3369/SASLib.jl/coverage.svg?branch=master)](http://codecov.io/github/tk3369/SASLib.jl?branch=master)
55

6-
This is a port of Pandas' read_sas function.
6+
This project started out as a port of Pandas' read_sas function. Since the first public release, several bugs have been fixed and additional features have been added e.g. reading a subset of columns. The goal is to have a fast reader that allows greater interoperability of Julia with the SAS ecosystem.
77

8-
Only `sas7bdat` format is supported, however. If anyone needs to read `xport` formatted files, please create an issue or contribute/send me a pull request.
8+
Only `sas7bdat` format is supported, however. If anyone needs to read `xport` files, please submit an issue. Pull requests are welcome as well.
99

1010
## Installation
1111

1212
```
1313
Pkg.add("SASLib")
1414
```
1515

16-
## Examples
16+
## Read Performance
17+
18+
I did benchmarking mostly on my Macbook Pro laptop. In general, the Julia implementation is somewhere between 7-25x faster than the Python counterpart. Test results are documented in the `test/perf_results_<version>` folders.
19+
20+
## User Guide
21+
22+
### Basic Use Case
1723

18-
Use the `readsas` function to read the file. The result is a dictionary of various information about the file as well as the data itself.
24+
Use the `readsas` function to read a SAS7BDAT file. The result is a dictionary of various information about the file as well as the data itself.
1925

2026
```julia
2127
julia> using SASLib
2228

2329
julia> x = readsas("productsales.sas7bdat")
24-
Read data set of size 1440 x 10 in 2.0 seconds
25-
Dict{Symbol,Any} with 16 entries:
30+
Read productsales.sas7bdat with size 1440 x 10 in 1.05315 seconds
31+
Dict{Symbol,Any} with 17 entries:
2632
:filename => "productsales.sas7bdat"
2733
:page_length => 8192
2834
:file_encoding => "US-ASCII"
2935
:system_endianness => :LittleEndian
3036
:ncols => 10
31-
:column_types => Type[Float64, Float64, Union{AbstractString, Missings.Missing}, Union{AbstractString, Missings.Missing}, Union{AbstractString,
32-
:data => Dict{Any,Any}(Pair{Any,Any}(:QUARTER, [1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 4.0 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0,
33-
:perf_type_conversion => 0.0262305
37+
:column_types => Type[Float64, Float64, String, String, String, String, String, Float64, Float64, Union{Date, Missings.Missing}]
38+
:column_info => Tuple{Int64,Symbol,Symbol,Type,DataType}[(1, :ACTUAL, :Number, Float64, Array{Float64,1}), (2, :PREDICT, :Number, Float64, A
39+
:data => Dict{Any,Any}(Pair{Any,Any}(:QUARTER, [1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 4.0 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.
40+
:perf_type_conversion => 0.0399293
3441
:page_count => 18
35-
:column_names => String["QUARTER", "YEAR", "COUNTRY", "DIVISION", "REGION", "MONTH", "PREDICT", "ACTUAL", "PRODTYPE", "PRODUCT"]
36-
:column_symbols => Symbol[:QUARTER, :YEAR, :COUNTRY, :DIVISION, :REGION, :MONTH, :PREDICT, :ACTUAL, :PRODTYPE, :PRODUCT]
42+
:column_names => String["ACTUAL", "PREDICT", "COUNTRY", "REGION", "DIVISION", "PRODTYPE", "PRODUCT", "QUARTER", "YEAR", "MONTH"]
43+
:column_symbols => Symbol[:ACTUAL, :PREDICT, :COUNTRY, :REGION, :DIVISION, :PRODTYPE, :PRODUCT, :QUARTER, :YEAR, :MONTH]
3744
:column_lengths => [8, 8, 10, 10, 10, 10, 10, 8, 8, 8]
3845
:file_endianness => :LittleEndian
3946
:nrows => 1440
40-
:perf_read_data => 0.00639309
47+
:perf_read_data => 0.035717
4148
:column_offsets => [0, 8, 40, 50, 60, 70, 80, 16, 24, 32]
4249
```
4350
44-
Number of columns and rows are returned as in `:ncols` and `:nrows` respectively.
45-
4651
The data, reference by `:data` key, is represented as a Dict object with the column symbol as the key.
4752
4853
```juia
@@ -55,16 +60,34 @@ julia> x[:data][:ACTUAL]
5560
656.0
5661
948.0
5762
612.0
58-
114.0
59-
685.0
60-
657.0
61-
608.0
62-
353.0
63-
107.0
6463
64+
6565
```
6666
67-
If you really like DataFrame, you can easily convert as such:
67+
Additional metadata are available as follows:
68+
69+
Key |Type |Description
70+
-----------------|---------------|-------------------------------
71+
:nrows | Int | Number of rows in the result
72+
:ncols | Int | Number of columns in the result
73+
:filename | String | Filename for which data was read
74+
:file_encoding | String | Character encoding used in the file
75+
:file_endianness | Symbol | Either :LittleEndian or :BigEndian
76+
:column_symbols | Array{Symbol} | Column symbols
77+
:column_names | Array{String} | Column names
78+
:column_types | Array{Type} | Column types e.g. Float64, String
79+
:column_info | Array{Tuple} | Tuple (column#, symbol, Num/Str, eltype, array type)
80+
:column_lengths | Array{Int} | Column lengths as in the SAS file format
81+
:column_offsets | Array{Int} | Column offsets as in the SAS file format
82+
:page_length | Int | Page length as in the SAS file format
83+
:page_count | Int | Number of pages as in the SAS file format
84+
:perf\_read\_data | Float | Performance stat: seconds used to read data into memory
85+
:perf\_type\_conversion | Float | Performance stat: seconds used to convert data to proper types e.g. Date/DateTime
86+
:system_endianness | Symbol | Either :LittleEndian or :BigEndian
87+
88+
### Conversion to DataFrame
89+
90+
Since the data is just a Dict of array columns, it's easy to convert into a DataFrame:
6891
6992
```julia
7093
julia> using DataFrames
@@ -82,7 +105,25 @@ julia> head(df, 5)
82105
5656.0 │ CANADA │ EDUCATION │ 1993-05-01646.0 │ FURNITURE │ SOFA │ 2.0 │ EAST │ 1993.0
83106
```
84107
85-
If you only need to read few columns, just pass an `include_columns` argument:
108+
You may find the columns being mixed up a bit annoying since a regular Dict does not have any concept of orders and DataFrame just sort them aphabetically. To work around that issue, you can leverage `:column_symbols` array, which has the _natural order_ from the file:
109+
110+
```
111+
julia> df = DataFrame(((c => x[:data][c]) for c in x[:column_symbols])...);
112+
113+
julia> head(df,5)
114+
5×10 DataFrames.DataFrame
115+
│ Row │ ACTUAL │ PREDICT │ COUNTRY │ REGION │ DIVISION │ PRODTYPE │ PRODUCT │ QUARTER │ YEAR │ MONTH │
116+
├─────┼────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼────────┼────────────┤
117+
1925.0850.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.01993.01993-01-01
118+
2999.0297.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.01993.01993-02-01
119+
3608.0846.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.01993.01993-03-01
120+
4642.0533.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.01993.01993-04-01
121+
5656.0646.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.01993.01993-05-01
122+
```
123+
124+
### Inclusion/Exclusion of Columns
125+
126+
It is always faster to read only the columns that you need. The `include_columns` argument comes in handy:
86127
87128
```
88129
julia> head(DataFrame(readsas("productsales.sas7bdat", include_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])[:data]))
@@ -114,7 +155,9 @@ Read data set of size 1440 x 6 in 0.031 seconds
114155
6 │ CANADA │ EDUCATION │ 486.0 │ FURNITURE │ 2.0 │ EAST │
115156
```
116157
117-
If you need to read files incrementally:
158+
### Incremental Reading
159+
160+
If you need to read files incrementally, you can do so as such:
118161
119162
```julia
120163
handler = SASLib.open("productsales.sas7bdat")
@@ -123,18 +166,64 @@ results = SASLib.read(handler, 4) # read next 4 rows
123166
SASLib.close(handler) # remember to close the handler when done
124167
```
125168
126-
## Read Performance
169+
Note that there is no facility at the moment to jump and read a subset of rows. Currently, SASLib always read from the beginning.
127170
128-
I don't have too much performance test results but initial comparison between SASLib.jl and Pandas on my Macbook Pro is encouraging. In general, the Julia implementation is somewhere between 4x to 7x faster than the Python counterpart. See the perf\_results\_* folders for test results related to the version being published.
171+
### String Columns
172+
173+
By default, string columns are read into a special AbstractArray structure called ObjectPool in order to conserve memory space that might otherwise be wasted for duplicate string values. SASLib tries to be smart -- when it encounters too many unique values (> 10%) in a large array (> 2000 rows), it falls back to a regular Julia array.
174+
175+
You can use a different array type (e.g. [CategoricalArray](https://github.com/JuliaData/CategoricalArrays.jl) or [PooledArray](https://github.com/JuliaComputing/PooledArrays.jl)) for any columns as you wish by specifying a `string_array_fn` parameter when reading the file. This argument must be a Dict that maps a column symbol into a function that takes an integer argument and returns any array of that size.
176+
177+
Here's the normal case:
178+
179+
```
180+
julia> x = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION]);
181+
Read productsales.sas7bdat with size 1440 x 2 in 0.00277 seconds
182+
183+
julia> typeof.(collect(values(x[:data])))
184+
2-element Array{DataType,1}:
185+
SASLib.ObjectPool{String,UInt16}
186+
SASLib.ObjectPool{String,UInt16}
187+
```
188+
189+
Now, you can force SASLib to use a regular array as such.
190+
191+
```
192+
julia> x = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
193+
string_array_fn=Dict(:COUNTRY => (n)->fill("",n)));
194+
Read productsales.sas7bdat with size 1440 x 2 in 0.05009 seconds
195+
196+
julia> typeof.(collect(values(x[:data])))
197+
2-element Array{DataType,1}:
198+
Array{String,1}
199+
SASLib.ObjectPool{String,UInt16}
200+
```
201+
202+
For convenience, `SASLib.REGULAR_STR_ARRAY` could be used instead. In addition, if you need all columns to be configured then the key of the `string_array_fn` dict may be just the symbol `:_all_`.
203+
204+
```
205+
julia> x = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
206+
string_array_fn=Dict(:_all_ => REGULAR_STR_ARRAY));
207+
Read productsales.sas7bdat with size 1440 x 2 in 0.01005 seconds
208+
209+
julia> typeof.(collect(values(x[:data])))
210+
2-element Array{DataType,1}:
211+
Array{String,1}
212+
Array{String,1}
213+
```
129214
130215
## Why another package?
131216
132-
At first, I was just going to use ReadStat. However, ReadStat does not support reading files with compressed binary data. I could have chosen to contribute to that project instead but I would rather learn and code in Julia ;-) The implementation in Pandas is fairly straightforward, making it a relatively easy porting project.
217+
At first, I was just going to use [ReadStat.jl](https://github.com/davidanthoff/ReadStat.jl), which uses the [ReadStat C-library](https://github.com/WizardMac/ReadStat). However, ReadStat does not support reading RDC-compressed binary files. I could have chosen to contribute to that project but I would rather learn and code in Julia instead ;-) The implementation in Pandas is fairly straightforward, making it a relatively easy porting project.
133218
134219
## Porting Notes
135220
136-
I chose to copy the code from Pandas and made minimal changes so I can have a working version quickly. Hence, the code isn't very Julia-friendly e.g. variable and function naming are all mixed up. It is not a priority at this point but I would think some major refactoring would be required to make it more clean & performant.
221+
I chose to copy the code from Pandas and made minimal changes so I can have a working version quickly. Hence, the code isn't very Julia-friendly e.g. variable and function naming are all mixed up. It is not a priority at this point but I would think some major refactoring would be required to clean up the code.
137222
138223
## Credits
139224
140-
Many thanks to Jared Hobbs, the original author of the SAS I/O code from Python Pandas. See LICENSE_SAS7BDAT.md for license details.
225+
- Jared Hobbs, the author of the SAS reader code from Python Pandas. See LICENSE_SAS7BDAT.md.
226+
- [Evan Miller](https://github.com/evanmiller), the author of ReadStat C/C++ library. See LICENSE_READSTAT.md.
227+
- [David Anthoff](https://github.com/davidanthoff), who provide many valuable ideas at the early stage of development.
228+
229+
I also want to thank all the active members at the [Julia Discourse community] (https://discourse.julialang.org). This project wouldn't be possible without all the help I got from the community. That's the beauty of open-source development.

src/ObjectPool.jl

+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
"""
2+
ObjectPool is a fixed-size one-dimensional array that does not store
3+
any duplicate copies of the same object. So the benefit is space-efficiency.
4+
The tradeoff is the time used to maintain the index.
5+
This is useful for denormalized data frames where string values
6+
may be repeated many times.
7+
8+
An ObjectPool must be initialize with a default value and a fixed
9+
array size. If your requirement does not fit such assumptions,
10+
you may want to look into using `PooledArrays` or
11+
`CategoricalArrays` package instead.
12+
13+
The implementation is very primitive and is tailor for application
14+
that knows exactly how much memory to allocate.
15+
"""
16+
mutable struct ObjectPool{T, S <: Unsigned} <: AbstractArray{T, 1}
17+
pool::Array{T} # maintains the pool of unique things
18+
idx::Array{S} # index references into `pool`
19+
indexcache::Dict{T, S} # dict for fast lookups (K=object, V=index)
20+
uniqueitemscount::Int64 # how many items in `pool`, always start with 1
21+
itemscount::Int64 # how many items perceived in this array
22+
capacity::Int64 # max number of items in the pool
23+
end
24+
25+
# Initially, there is only one item in the pool and the `idx` array has
26+
# elements all pointing to that one default vaue. The dictionary `indexcache`
27+
# also has one item that points to that one value. Hence `uniqueitemcount`
28+
# would be 1 and `itemscount` would be `n`.
29+
function ObjectPool{T, S}(val::T, n::Integer) where {T, S <: Unsigned}
30+
# Note: 64-bit case is constrainted by Int64 type (for convenience)
31+
maxsize = ifelse(S == UInt8, 2 << 7 - 1,
32+
ifelse(S == UInt16, 2 << 15 - 1,
33+
ifelse(S == UInt32, 2 << 31 - 1,
34+
2 << 62 - 1)))
35+
ObjectPool{T, S}([val], fill(1, n), Dict(val => 1), 1, n, maxsize)
36+
end
37+
38+
# If the value already exist in the pool then just the index value is stored.
39+
function Base.setindex!{T}(op::ObjectPool, val::T, i::Integer)
40+
if haskey(op.indexcache, val)
41+
# The value `val` already exists in the cache.
42+
# Just set the array element to the index value from cache.
43+
op.idx[i] = op.indexcache[val]
44+
else
45+
if op.uniqueitemscount >= op.capacity
46+
throw(BoundsError("Exceeded pool capacity $(op.capacity). Consider using a larger pool size e.g. UInt32."))
47+
end
48+
# Encountered a new value `val`:
49+
# 1. add ot the object pool array
50+
# 2. increment the number of unique items
51+
# 3. store the new index in the cache
52+
# 4. set the array element with the new index value
53+
push!(op.pool, val)
54+
op.uniqueitemscount += 1
55+
op.indexcache[val] = op.uniqueitemscount
56+
op.idx[i] = op.uniqueitemscount
57+
end
58+
op
59+
end
60+
61+
# AbstractArray trait
62+
# Base.IndexStyle(::Type{<:ObjectPool}) = IndexLinear()
63+
64+
# single indexing
65+
Base.getindex(op::ObjectPool, i::Number) = op.pool[op.idx[convert(Int, i)]]
66+
67+
# general sizes
68+
Base.size(op::ObjectPool) = (op.itemscount, )
69+
# Base.length(op::ObjectPool) = op.itemscount
70+
# Base.endof(op::ObjectPool) = op.itemscount
71+
72+
# typing
73+
#Base.eltype(op::ObjectPool) = eltype(op.pool)
74+
75+
# make it iterable
76+
# Base.start(op::ObjectPool) = 1
77+
# Base.next(op::ObjectPool, state) = (op.pool[op.idx[state]], state + 1)
78+
# Base.done(op::ObjectPool, state) = state > op.itemscount
79+
80+
# custom printing
81+
# function Base.show(io::IO, op::ObjectPool)
82+
# L = op.itemscount
83+
# print(io, "$L-element ObjectPool with $(op.uniqueitemscount) unique items:\n")
84+
# if L > 20
85+
# for i in 1:10 print(io, " ", op[i], "\n") end
86+
# print(io, " ⋮\n")
87+
# for i in L-9:L print(io, " ", op[i], "\n") end
88+
# else
89+
# for i in 1:L print(io, " ", op[i], "\n") end
90+
# end
91+
# end

0 commit comments

Comments
 (0)