Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add KLDivergence class #192

Merged
merged 33 commits into from
Feb 16, 2022
Merged

add KLDivergence class #192

merged 33 commits into from
Feb 16, 2022

Conversation

knaaptime
Copy link
Member

this PR adds the enhancements I requested over in #185. Thanks a ton for putting this together @noahbouchier!!

Aside from the formatting, there is one substantive change: instead of returning the matrix of Origin: [statistic, distance], it returns a dataframe of the KL divergence coefficients (the sum of KL profile values for each unit)

the data arent exactly the same, so no thorough testing just yet, but a quick look suggests we've (left) got the same visual results as the paper (right). One thing to note here, we will want to add the option to normalize score values as they do in the original paper. Shouldnt be too hard but it will require us to write another function to determine the scaling factor

image

@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #192 (4f03ea5) into master (deae25c) will decrease coverage by 0.07%.
The diff coverage is 71.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
- Coverage   62.43%   62.35%   -0.08%     
==========================================
  Files         119      123       +4     
  Lines        4784     4928     +144     
==========================================
+ Hits         2987     3073      +86     
- Misses       1797     1855      +58     
Impacted Files Coverage Δ
segregation/batch/batch_compute.py 90.36% <ø> (ø)
segregation/dynamics/__init__.py 100.00% <ø> (ø)
segregation/local/local_multi_diversity.py 95.23% <0.00%> (ø)
segregation/singlegroup/dissim.py 96.29% <ø> (ø)
segregation/singlegroup/modified_dissim.py 67.27% <26.31%> (-23.64%) ⬇️
segregation/singlegroup/modified_gini.py 69.23% <26.31%> (-25.90%) ⬇️
segregation/network/network.py 64.93% <60.71%> (+39.35%) ⬆️
segregation/dynamics/divergence_profile.py 82.35% <82.35%> (ø)
segregation/_base.py 82.47% <85.71%> (ø)
segregation/local/local_distortion.py 94.73% <94.73%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update deae25c...4f03ea5. Read the comment docs.

@knaaptime
Copy link
Member Author

we will probably need to think through the terminology a bit:

  • I think what i've done here should maybe be called a LocalDistortionCoefficient following the nomenclature they use in this paper.
  • The KL profiles we should probably put into the dynamics module
  • the GlobalDistortionIndex from this paper should go in the multigroup module based on this formula
    image

I think the _kl_divergence_profile @noahbouchier wrote gets us 90% of the way to each of those measures, its just a matter of how we summarize and reshape the output for these different measures

list of columns on gdf that contain population counts of interest
network: pandana.Network object (optional)
A pandana Network object used to compute distance between observations
metric : str
Copy link
Member

@ljwolf ljwolf Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this work like scikit's precomputed argument? So:

  1. the "normal" behavior computes distances as is from the geodataframe: def(gdf, groups, distances=None, metric='euclidean')
  2. The metric='network' option treats an input distance matrix as a network:
    def(gdf, groups, distances=network, metric='network')
  3. The metric='precomputed' option treats the input distance matrix as given:
    def(gdf, groups, distances=my_matrix, metric='precomputed')

Copy link
Member Author

@knaaptime knaaptime Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm game! I think the only change would be how the arguments are parsed? Currently network is only used if it's passed, otherwise the Sklearn metric is used (the difference being you need to pass the pandana network for the distance matrix to get computed internally if you want the 'network' distance option

So if you include a pandana.Network then metric=network (but the additional option is there because the function needs the Network object to calculate distances. I think the only change would be to allow an additional pre-computed distance matrix (that's not a pandana network) otherwise we basically have this pattern, no?

we use this same api in all the other measures but I'm cool to update it across the package if you think that would work better

(Also once this is merged the spatial and aspatial modules are getting nixed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this might be worth a longer conversation. For all spatially-implicit indices, distance is computed internally via the user passing a W or a Network object, which is how they get 'full control' over how distance is specified. That design makes it straightforward to handle multiscalar profiles for any SpatialImplicit index by building W/Networks with increasing distance bands. For spatially-explicit indices, the class usually has more options because consistency isnt important in the signature (you cant compute a multiscalar profile on them).

I guess my take is we've been looking for the same kind of flexibility you're describing but thought of it more as being flexible in the specification of "neighbor" than "distance". Obviously, it's only a slightly different abstraction for the same concept, but for segregation, specifically, it feels more natural to think about "this neighborhood is the 500m radius around the tract" than "this neighborhood is the average mahalanobis distance from this tract to every other observation in the study area" and W/Network feels like the easier way of specifying the former?

For spatially-explicit indices (like this one) i think the only real difference is you might want to include the option to provide a precomputed distance matrix?

im definitely up for moving toward the best solution, but can you say a bit more about what that would look like at the package level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what im saying is ultimately the relationship is always represented by some kind of distance matrix, but from the user's perspective I've thought that it's usually preferable to specify the neighborhood using a W or a Network object and have us create the distmat on the backend

I just remembered a W is already setup for this(!), so probably the best thing to do is allow every index also accept a metric argument that gets pasesed to the distance_metric argument.

Copy link
Member Author

@knaaptime knaaptime Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of this function specifically, if you pass a pandana.Network, it's implied that metric='shortest_network_path' (which would require you provide a pandana.Network, so this is just a different way of doing the check.

I think the difference between what i've done here and what you're asking is that you could also get the shortest_network_path ahead of time, then do metric='precomputed' and provide distances=compute_travel_cost_matrix(Network). I think the only difference between those two is we dont allow you to pass the precomputed distances, but otherwise i think all the different configurations are supported? I'd probably defer to you on the best way to handle the API

@ljwolf
Copy link
Member

ljwolf commented Feb 8, 2022

Would it be cool to keep a more scikit-oriented API with the metric argument? this was the intent behind specifying the metric argument as we did in GSoC. We can extend the scikit-style metric argument to include a special "network" option, too, and treat it as a special case as they already do metric='precomputed'...

Let me know what you think! Happy to implement.

@knaaptime
Copy link
Member Author

aside from the change to allow precomputed distance metrices, I've finished all the pieces i described above. We've now got

  • dynamics.compute_divergence_profiles
  • multigroup.GlobalDistortion
  • local.LocalDistortion

The important remaining work to do is to write a function that simulates the "maximum possible segregation" given the study data. Until we have that, the GlobalDistortion index isn't scaled, so doesnt mean much

@knaaptime
Copy link
Member Author

@jGaboardi if you got a spare minute, any idea what's going on with these windows tests? They seem to be stalling like a quarter of the way through with no apparent reason?

@jGaboardi
Copy link
Member

jGaboardi commented Feb 10, 2022

It may be a similar issue to what's going on in spaghetti --> pysal/spaghetti#666. (But also, maybe not...)

@knaaptime
Copy link
Member Author

ok, so after some more thought i think the only remaining question here is how we standardize the api. A "standard" segregation profile works by increasing the threshold in a distance-based weights object (regardless of how many neighbors), then applying distance decay and recomputing the segregation statistic. The KL profile works by increasing the K in a KNN weights object (regardless of distance) then applying distance decay to the neighbors and recalculating the statistic. In both cases we're just expanding the spatial scope, so is it worth it to abstract out the shared logic that both functions can consume?

even if the two functions dont share code, it would be good to standardize the arguments. The current segregation profile already requires a distances argument, which is a list of threshold distances passed to a libpysal.W (or a pandana.Network aggregation query). In the approach described above, distances would be either the matrix of pairwise distances between observations if metric=precomputed, or a pandana.Network if metric=network (which would just create the distance matrix internally, so you'd also need to provide the pandana.Network object)

I think its confusing for one *profile function to have distance as a list of thresholds and another that has distance as a pairwise distance matrix, so what about:

  • in compute_segregation_profile:
    • changing the argument distances to distance_thresholds
    • take a metric argument (it can just pass it through to the W)
  • in compute_divergence_profiles
    • take a distance_matrix argument, which can can be either a numpy array (implying metric=precomputed or a pandana.Network (implying metric=network_lcp or something)
    • in compute_divergence_profiles, you always step through all neighbors, so there's no translation of *thresholds

@knaaptime knaaptime merged commit dbb15c5 into pysal:master Feb 16, 2022
@knaaptime knaaptime deleted the divergence branch February 16, 2022 18:08
@sjsrey
Copy link
Member

sjsrey commented Feb 26, 2022

Does this close #142 or is there more to do there?

@knaaptime
Copy link
Member Author

i'd say this resolves #142, though we do still need to implement the scaling factor computation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants