-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add KLDivergence class #192
Conversation
Codecov Report
@@ Coverage Diff @@
## master #192 +/- ##
==========================================
- Coverage 62.43% 62.35% -0.08%
==========================================
Files 119 123 +4
Lines 4784 4928 +144
==========================================
+ Hits 2987 3073 +86
- Misses 1797 1855 +58
Continue to review full report at Codecov.
|
we will probably need to think through the terminology a bit:
I think the |
list of columns on gdf that contain population counts of interest | ||
network: pandana.Network object (optional) | ||
A pandana Network object used to compute distance between observations | ||
metric : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this work like scikit's precomputed
argument? So:
- the "normal" behavior computes distances as is from the geodataframe:
def(gdf, groups, distances=None, metric='euclidean')
- The
metric='network'
option treats an input distance matrix as a network:
def(gdf, groups, distances=network, metric='network')
- The
metric='precomputed'
option treats the input distance matrix as given:
def(gdf, groups, distances=my_matrix, metric='precomputed')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm game! I think the only change would be how the arguments are parsed? Currently network is only used if it's passed, otherwise the Sklearn metric is used (the difference being you need to pass the pandana network for the distance matrix to get computed internally if you want the 'network' distance option
So if you include a pandana.Network then metric=network (but the additional option is there because the function needs the Network object to calculate distances. I think the only change would be to allow an additional pre-computed distance matrix (that's not a pandana network) otherwise we basically have this pattern, no?
we use this same api in all the other measures but I'm cool to update it across the package if you think that would work better
(Also once this is merged the spatial and aspatial modules are getting nixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this might be worth a longer conversation. For all spatially-implicit indices, distance is computed internally via the user passing a W or a Network object, which is how they get 'full control' over how distance is specified. That design makes it straightforward to handle multiscalar profiles for any SpatialImplicit index by building W/Networks with increasing distance bands. For spatially-explicit indices, the class usually has more options because consistency isnt important in the signature (you cant compute a multiscalar profile on them).
I guess my take is we've been looking for the same kind of flexibility you're describing but thought of it more as being flexible in the specification of "neighbor" than "distance". Obviously, it's only a slightly different abstraction for the same concept, but for segregation, specifically, it feels more natural to think about "this neighborhood is the 500m radius around the tract" than "this neighborhood is the average mahalanobis distance from this tract to every other observation in the study area" and W/Network feels like the easier way of specifying the former?
For spatially-explicit indices (like this one) i think the only real difference is you might want to include the option to provide a precomputed distance matrix?
im definitely up for moving toward the best solution, but can you say a bit more about what that would look like at the package level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what im saying is ultimately the relationship is always represented by some kind of distance matrix, but from the user's perspective I've thought that it's usually preferable to specify the neighborhood using a W or a Network object and have us create the distmat on the backend
I just remembered a W
is already setup for this(!), so probably the best thing to do is allow every index also accept a metric
argument that gets pasesed to the distance_metric
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case of this function specifically, if you pass a pandana.Network
, it's implied that metric='shortest_network_path'
(which would require you provide a pandana.Network
, so this is just a different way of doing the check.
I think the difference between what i've done here and what you're asking is that you could also get the shortest_network_path ahead of time, then do metric='precomputed'
and provide distances=compute_travel_cost_matrix(Network)
. I think the only difference between those two is we dont allow you to pass the precomputed distances
, but otherwise i think all the different configurations are supported? I'd probably defer to you on the best way to handle the API
Would it be cool to keep a more scikit-oriented API with the Let me know what you think! Happy to implement. |
aside from the change to allow precomputed distance metrices, I've finished all the pieces i described above. We've now got
The important remaining work to do is to write a function that simulates the "maximum possible segregation" given the study data. Until we have that, the GlobalDistortion index isn't scaled, so doesnt mean much |
@jGaboardi if you got a spare minute, any idea what's going on with these windows tests? They seem to be stalling like a quarter of the way through with no apparent reason? |
It may be a similar issue to what's going on in |
ok, so after some more thought i think the only remaining question here is how we standardize the api. A "standard" segregation profile works by increasing the threshold in a distance-based weights object (regardless of how many neighbors), then applying distance decay and recomputing the segregation statistic. The KL profile works by increasing the K in a KNN weights object (regardless of distance) then applying distance decay to the neighbors and recalculating the statistic. In both cases we're just expanding the spatial scope, so is it worth it to abstract out the shared logic that both functions can consume? even if the two functions dont share code, it would be good to standardize the arguments. The current segregation profile already requires a I think its confusing for one *profile function to have
|
Does this close #142 or is there more to do there? |
i'd say this resolves #142, though we do still need to implement the scaling factor computation |
this PR adds the enhancements I requested over in #185. Thanks a ton for putting this together @noahbouchier!!
Aside from the formatting, there is one substantive change: instead of returning the matrix of Origin: [statistic, distance], it returns a dataframe of the KL divergence coefficients (the sum of KL profile values for each unit)
the data arent exactly the same, so no thorough testing just yet, but a quick look suggests we've (left) got the same visual results as the paper (right). One thing to note here, we will want to add the option to normalize score values as they do in the original paper. Shouldnt be too hard but it will require us to write another function to determine the scaling factor