add KLDivergence class #192

knaaptime · 2022-02-07T08:10:38Z

this PR adds the enhancements I requested over in #185. Thanks a ton for putting this together @noahbouchier!!

Aside from the formatting, there is one substantive change: instead of returning the matrix of Origin: [statistic, distance], it returns a dataframe of the KL divergence coefficients (the sum of KL profile values for each unit)

the data arent exactly the same, so no thorough testing just yet, but a quick look suggests we've (left) got the same visual results as the paper (right). One thing to note here, we will want to add the option to normalize score values as they do in the original paper. Shouldnt be too hard but it will require us to write another function to determine the scaling factor

codecov · 2022-02-07T16:46:17Z

Codecov Report

Merging #192 (4f03ea5) into master (deae25c) will decrease coverage by 0.07%.
The diff coverage is 71.69%.

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
- Coverage   62.43%   62.35%   -0.08%     
==========================================
  Files         119      123       +4     
  Lines        4784     4928     +144     
==========================================
+ Hits         2987     3073      +86     
- Misses       1797     1855      +58

Impacted Files	Coverage Δ
segregation/batch/batch_compute.py	`90.36% <ø> (ø)`
segregation/dynamics/__init__.py	`100.00% <ø> (ø)`
segregation/local/local_multi_diversity.py	`95.23% <0.00%> (ø)`
segregation/singlegroup/dissim.py	`96.29% <ø> (ø)`
segregation/singlegroup/modified_dissim.py	`67.27% <26.31%> (-23.64%)`	⬇️
segregation/singlegroup/modified_gini.py	`69.23% <26.31%> (-25.90%)`	⬇️
segregation/network/network.py	`64.93% <60.71%> (+39.35%)`	⬆️
segregation/dynamics/divergence_profile.py	`82.35% <82.35%> (ø)`
segregation/_base.py	`82.47% <85.71%> (ø)`
segregation/local/local_distortion.py	`94.73% <94.73%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update deae25c...4f03ea5. Read the comment docs.

knaaptime · 2022-02-07T23:19:07Z

we will probably need to think through the terminology a bit:

I think what i've done here should maybe be called a LocalDistortionCoefficient following the nomenclature they use in this paper.
The KL profiles we should probably put into the dynamics module
the GlobalDistortionIndex from this paper should go in the multigroup module based on this formula

I think the _kl_divergence_profile @noahbouchier wrote gets us 90% of the way to each of those measures, its just a matter of how we summarize and reshape the output for these different measures

ljwolf · 2022-02-08T10:15:43Z

segregation/dynamics/divergence_profile.py

+        list of columns on gdf that contain population counts of interest
+    network: pandana.Network object (optional)
+        A pandana Network object used to compute distance between observations
+    metric : str


Could we make this work like scikit's precomputed argument? So:

the "normal" behavior computes distances as is from the geodataframe: def(gdf, groups, distances=None, metric='euclidean')

The metric='network' option treats an input distance matrix as a network:
def(gdf, groups, distances=network, metric='network')

The metric='precomputed' option treats the input distance matrix as given:
def(gdf, groups, distances=my_matrix, metric='precomputed')

I'm game! I think the only change would be how the arguments are parsed? Currently network is only used if it's passed, otherwise the Sklearn metric is used (the difference being you need to pass the pandana network for the distance matrix to get computed internally if you want the 'network' distance option

So if you include a pandana.Network then metric=network (but the additional option is there because the function needs the Network object to calculate distances. I think the only change would be to allow an additional pre-computed distance matrix (that's not a pandana network) otherwise we basically have this pattern, no?

we use this same api in all the other measures but I'm cool to update it across the package if you think that would work better

(Also once this is merged the spatial and aspatial modules are getting nixed)

actually, this might be worth a longer conversation. For all spatially-implicit indices, distance is computed internally via the user passing a W or a Network object, which is how they get 'full control' over how distance is specified. That design makes it straightforward to handle multiscalar profiles for any SpatialImplicit index by building W/Networks with increasing distance bands. For spatially-explicit indices, the class usually has more options because consistency isnt important in the signature (you cant compute a multiscalar profile on them).

I guess my take is we've been looking for the same kind of flexibility you're describing but thought of it more as being flexible in the specification of "neighbor" than "distance". Obviously, it's only a slightly different abstraction for the same concept, but for segregation, specifically, it feels more natural to think about "this neighborhood is the 500m radius around the tract" than "this neighborhood is the average mahalanobis distance from this tract to every other observation in the study area" and W/Network feels like the easier way of specifying the former?

For spatially-explicit indices (like this one) i think the only real difference is you might want to include the option to provide a precomputed distance matrix?

im definitely up for moving toward the best solution, but can you say a bit more about what that would look like at the package level?

I guess what im saying is ultimately the relationship is always represented by some kind of distance matrix, but from the user's perspective I've thought that it's usually preferable to specify the neighborhood using a W or a Network object and have us create the distmat on the backend

I just remembered a W is already setup for this(!), so probably the best thing to do is allow every index also accept a metric argument that gets pasesed to the distance_metric argument.

in the case of this function specifically, if you pass a pandana.Network, it's implied that metric='shortest_network_path' (which would require you provide a pandana.Network, so this is just a different way of doing the check.

I think the difference between what i've done here and what you're asking is that you could also get the shortest_network_path ahead of time, then do metric='precomputed' and provide distances=compute_travel_cost_matrix(Network). I think the only difference between those two is we dont allow you to pass the precomputed distances, but otherwise i think all the different configurations are supported? I'd probably defer to you on the best way to handle the API

ljwolf · 2022-02-08T10:17:19Z

Would it be cool to keep a more scikit-oriented API with the metric argument? this was the intent behind specifying the metric argument as we did in GSoC. We can extend the scikit-style metric argument to include a special "network" option, too, and treat it as a special case as they already do metric='precomputed'...

Let me know what you think! Happy to implement.

knaaptime · 2022-02-08T17:09:52Z

aside from the change to allow precomputed distance metrices, I've finished all the pieces i described above. We've now got

dynamics.compute_divergence_profiles
multigroup.GlobalDistortion
local.LocalDistortion

The important remaining work to do is to write a function that simulates the "maximum possible segregation" given the study data. Until we have that, the GlobalDistortion index isn't scaled, so doesnt mean much

knaaptime · 2022-02-10T18:19:48Z

@jGaboardi if you got a spare minute, any idea what's going on with these windows tests? They seem to be stalling like a quarter of the way through with no apparent reason?

jGaboardi · 2022-02-10T18:31:47Z

It may be a similar issue to what's going on in spaghetti --> pysal/spaghetti#666. (But also, maybe not...)

knaaptime · 2022-02-10T19:50:05Z

ok, so after some more thought i think the only remaining question here is how we standardize the api. A "standard" segregation profile works by increasing the threshold in a distance-based weights object (regardless of how many neighbors), then applying distance decay and recomputing the segregation statistic. The KL profile works by increasing the K in a KNN weights object (regardless of distance) then applying distance decay to the neighbors and recalculating the statistic. In both cases we're just expanding the spatial scope, so is it worth it to abstract out the shared logic that both functions can consume?

even if the two functions dont share code, it would be good to standardize the arguments. The current segregation profile already requires a distances argument, which is a list of threshold distances passed to a libpysal.W (or a pandana.Network aggregation query). In the approach described above, distances would be either the matrix of pairwise distances between observations if metric=precomputed, or a pandana.Network if metric=network (which would just create the distance matrix internally, so you'd also need to provide the pandana.Network object)

I think its confusing for one *profile function to have distance as a list of thresholds and another that has distance as a pairwise distance matrix, so what about:

in compute_segregation_profile:
- changing the argument distances to distance_thresholds
- take a metric argument (it can just pass it through to the W)
in compute_divergence_profiles
- take a distance_matrix argument, which can can be either a numpy array (implying metric=precomputed or a pandana.Network (implying metric=network_lcp or something)
- in compute_divergence_profiles, you always step through all neighbors, so there's no translation of *thresholds

sjsrey · 2022-02-26T15:44:50Z

Does this close #142 or is there more to do there?

knaaptime · 2022-02-28T18:39:01Z

i'd say this resolves #142, though we do still need to implement the scaling factor computation

noahbouchier and others added 15 commits August 13, 2021 14:46

Initial commit of divergence profile function

6878ce4

Updates of KL divergence function

debb9a1

Updates of KL divergence function

2c2c01e

Updates of KL divergence function

6fb3cf3

Updated commit of divergence profile function

aa71f2e

Updating of function and commit of function walkthrough notebook

88bd960

fix warning for islands

7f1b7bf

allow extra kwargs in multigroup indices

fdf9210

faster densifying W in pardissim and bsd

53f051f

reset index on decomposition cdf plot

3411617

Move to versioneer

9690603

[WIP] Debugging circular import

0fb3b38

Move version def to fix circular import

48772f4

pin jsonschema in testing

a70481b

KLDivergence class

3fba80f

add global distortion

ead5978

ljwolf reviewed Feb 8, 2022

View reviewed changes

knaaptime requested review from sjsrey and renanxcortes February 8, 2022 17:10

knaaptime added 7 commits February 8, 2022 09:17

update distortion docstrings

f6ddc2f

update batch compute test to include global divergence

3de9cbe

allow extra kwargs

1669cbd

Merge branch 'master' of github.com:pysal/segregation into divergence

dfb1ad9

seeding in parallel random indices

ad35f7f

pass kwargs in seg profile

a91e6b0

update modgini and moddissim tests

0e6927f

knaaptime added 5 commits February 9, 2022 17:26

retry if simulator generates invalid data

b077caa

fix retrier in sim indices

6e406fc

loky backend in testing

a34ffbc

loky in single tests too

bd4af81

add local distortion test

a33b37b

jGaboardi mentioned this pull request Feb 10, 2022

Stalling CI tests pysal/spaghetti#666

Closed

knaaptime added 5 commits February 15, 2022 10:36

bump timeout and fail slow in testing

de60c84

allod precomputed distances in divergence indices

fbefc4c

docstrings

1ce827a

pass distmat in global distortion index

358b742

global divergence args

4f03ea5

knaaptime merged commit dbb15c5 into pysal:master Feb 16, 2022

knaaptime deleted the divergence branch February 16, 2022 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add KLDivergence class #192

add KLDivergence class #192

knaaptime commented Feb 7, 2022

codecov bot commented Feb 7, 2022 •

edited

Loading

knaaptime commented Feb 7, 2022

ljwolf Feb 8, 2022 •

edited

Loading

knaaptime Feb 8, 2022 •

edited

Loading

knaaptime Feb 8, 2022

knaaptime Feb 8, 2022

knaaptime Feb 8, 2022 •

edited

Loading

ljwolf commented Feb 8, 2022 •

edited

Loading

knaaptime commented Feb 8, 2022

knaaptime commented Feb 10, 2022

jGaboardi commented Feb 10, 2022 •

edited

Loading

knaaptime commented Feb 10, 2022

sjsrey commented Feb 26, 2022 •

edited

Loading

knaaptime commented Feb 28, 2022

add KLDivergence class #192

add KLDivergence class #192

Conversation

knaaptime commented Feb 7, 2022

codecov bot commented Feb 7, 2022 • edited Loading

Codecov Report

knaaptime commented Feb 7, 2022

ljwolf Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

knaaptime Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

knaaptime Feb 8, 2022

Choose a reason for hiding this comment

knaaptime Feb 8, 2022

Choose a reason for hiding this comment

knaaptime Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

ljwolf commented Feb 8, 2022 • edited Loading

knaaptime commented Feb 8, 2022

knaaptime commented Feb 10, 2022

jGaboardi commented Feb 10, 2022 • edited Loading

knaaptime commented Feb 10, 2022

sjsrey commented Feb 26, 2022 • edited Loading

knaaptime commented Feb 28, 2022

codecov bot commented Feb 7, 2022 •

edited

Loading

ljwolf Feb 8, 2022 •

edited

Loading

knaaptime Feb 8, 2022 •

edited

Loading

knaaptime Feb 8, 2022 •

edited

Loading

ljwolf commented Feb 8, 2022 •

edited

Loading

jGaboardi commented Feb 10, 2022 •

edited

Loading

sjsrey commented Feb 26, 2022 •

edited

Loading