Skip to content

Commit

Permalink
Improved the Readme, changed wilbert to staggered as default algorithm
Browse files Browse the repository at this point in the history
  • Loading branch information
Feelx234 committed Jan 8, 2025
1 parent e32f4d7 commit c6dc020
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 12 deletions.
28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ microagg1d

A Python library which implements different techniques for optimal univariate microaggregation. The two main parameters that determine the runtime are the length n of the input array and minimal class size k. This package offers both O(n) (fast for large k) and O(kn) (fast for small k) algorithms.

The code is written in Python and relies on the [numba](https://numba.pydata.org/) compiler for speed.
The code is written in Python and but moth of the number crunching is happening in compiled code. This is achieved with the [numba](https://numba.pydata.org/) compiler.

Requirements
------------
Expand All @@ -30,12 +30,22 @@ Example Usage
```python
from microagg1d import univariate_microaggregation

x = [5, 1, 1, 1.1, 5, 1, 5.1]
x = [6, 2, 0.7, 1.1, 5, 1, 5.1]

clusters = univariate_microaggregation(x, k=3)

print(clusters) # [1 0 0 0 1 0 1]
```
The element at the i-th position of `clusters` assigns the i-th element of the input array `x` to its optimal cluster. In our small example, the optimal clustering is achieved with two clusters. The `6` at the first position of `x` is assigned to cluster one together with the `5`and `5.1`.
All remaining entries are assigned to cluster zero.


**Important notice**: On first import the code is compiled once which may take about 30s. On subsequent imports this is no longer necessary and imports are almost instant.

Below we show that one may choose the algorithm used to solve the univariate microaggregation task using the `method` keyword (valid choices are `"auto"`[default], `"simple"`, `"wilber"`, `"galil_park"`, and `"staggered"`).
Similarly the cost function used to solve the task can be adjusted using the `cost` keyword. Valid choices for the `cost` keyword are `"sse"` (sum of squares error), `"sae"`(sum absolute error), `"maxdist"` (maximum distance), `"roundup"`, and `"rounddown"`.

```python
# explicitly choose method / algorithm
clusters2 = univariate_microaggregation(x, k=3, method="wilber")

Expand All @@ -47,7 +57,7 @@ clusters3 = univariate_microaggregation(x, k=3, cost="sae")
print(clusters3) # [1 0 0 0 1 0 1]
```

**Important notice**: On first import the the code is compiled once which may take about 30s. On subsequent imports this is no longer necessary and imports are almost instant.


Tests
-----
Expand All @@ -59,14 +69,14 @@ Tests are in [tests/](https://github.com/Feelx234/microagg1d/tree/main/tests).
$ python3 -m pytest .
```

Method Details
Details on the Algorithms
--------------

Currently the package implements the following methods:
- `"simple"` [O(nk), faster for small k]
- `"wilber"` [O(n), faster for larger k]
- `"galil_park"` [O(n), fewer calls to SMAWK]
- `"staggered"` [fastest O(n)]
Currently the package implements the following four algorithms:
- `"simple"` [$O(nk)$, faster for small k]
- `"wilber"` [$O(n)$, faster for larger k]
- `"galil_park"` [$O(n)$, faster for larger k, fewer calls to SMAWK]
- `"staggered"` [fastest $O(n)$]

By default, the package switches between the simple and wilber method depending on the size of k.

Expand Down
4 changes: 2 additions & 2 deletions src/microagg1d/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,10 @@ def univariate_microaggregation(x, k, method="auto", stable=1, cost="sse"):
"staggered",
), "invalid method supplied"
if method == "auto":
if k <= 21: # 21 determined emperically
if k <= 21: # 21 determined empirically
method = "simple"
else:
method = "wilber"
method = "staggered"

order = np.argsort(x)
x = np.array(x, dtype=np.float64)[order]
Expand Down
2 changes: 1 addition & 1 deletion tests/test_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def test_example_usage(self):
# pylint: disable=redefined-outer-name,reimported,import-outside-toplevel
from microagg1d import univariate_microaggregation

x = [5, 1, 1, 1.1, 5, 1, 5.1]
x = [6, 2, 0.7, 1.1, 5, 1, 5.1]

clusters = univariate_microaggregation(x, k=3)

Expand Down

0 comments on commit c6dc020

Please sign in to comment.