Skip to content

Commit

Permalink
Merge pull request #33 from daltare/working-branch
Browse files Browse the repository at this point in the history
fix kable tables (column names, number formatting, sorting, etc)
  • Loading branch information
daltare authored Mar 30, 2024
2 parents c7b84f8 + 8ab4417 commit 7929a54
Show file tree
Hide file tree
Showing 2 changed files with 122 additions and 58 deletions.

Large diffs are not rendered by default.

176 changes: 120 additions & 56 deletions 01_document/example_census_race_ethnicity_calculation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,7 @@ To get data from the ACS, you can use the `get_acs()` function, which is very si
acs_year <- 2022
```

However, since the ACS data contains data on a much broader set of socio-economic metrics, the requested data includes a greatly expanded list of variables, defined in the `census_vars_acs` object (see @sec-census-variables for more information about how to discover variables of interest and find their associated codes). As above, we can provide descriptive names associated with each variable code, which makes the data easier to work with later, but isn't strictly necessary (i.e., you could just supply the variable codes alone). Note that the use of prefixes (like `population_` or `households_`) and suffixes (like `_count`) is intentional -- those will be used later as part of the calculation process.
However, since the ACS data contains data on a much broader set of socio-economic metrics, the requested data includes a greatly expanded list of variables, defined in the `census_vars_acs` object (see @sec-census-variables for more information about how to discover variables of interest and find their associated codes). As above, we can provide descriptive names associated with each variable code, which makes the data easier to work with later, but isn't strictly necessary (i.e., you could just supply the variable codes alone). Note that the use of prefixes (like `population_` or `households_`) and suffixes (like `_count`) is intentional -- those will be used later as part of the calculation process.

```{r}
#| message: false
Expand Down Expand Up @@ -567,7 +567,9 @@ There are multiple ways this estimation can be done. For this example, we'll emp
The major simplifying assumption of this approach is that the population or count-based variable of interest are evenly distributed within each unit in the *source* data. For example, in this case we're assuming that population (including the total population and the population of each racial/ethic group), households of each income bracket, populations above / below the poverty rate, etc. are evenly distributed within each census block group.

::: callout-tip
While this section uses the block group-level count data from the 5-year ACS, there may be cases where it could be useful or necessary to use more granular block-level population data from the decennial census to estimate population densities and distributions within larger census units, like block groups and tracts. This could especially be the case when estimating characteristics for small areas in rural environments. See @sec-detailed-pop-estimates and/or @sec-small-area-estimates for more information.
While this section uses the block group-level count data from the 5-year ACS, there may be cases where it could be useful or necessary to use more granular block-level population data from the decennial census to estimate population densities and distributions within block groups. This could especially be the case when estimating characteristics for small and/or rural areas. See @sec-alternative-interpolate_pw for an approach which implements a method that does that, and @sec-detailed-pop-estimates for detailed estimates of population alone.

See @sec-small-area-estimates for more information about challenges estimating values for small / rural areas.
:::

2. Using the estimated count data (populations, households, etc), compute weighted values for variables that describe those populations, using the associated count data as a weighting factor (e.g., population-weighted values for population based data, or household-weighted values for household-based data) – these variables are typically referred to as 'intensive' data types.
Expand Down Expand Up @@ -635,6 +637,10 @@ census_data_acs <- census_data_acs %>%

There are a couple of ways to implement the areal interpolation method. The example below 'manually' implements the process using functions from the `sf` package, for reasons described below. However, note that there are R packages which make it possible to perform areal interpolation with a single function - for example, the `sf` package's [`st_interpolate_aw`](https://r-spatial.github.io/sf/reference/interpolate_aw.html) function and the [`areal`](https://chris-prener.github.io/areal/) package's [`aw_interpolate`](https://chris-prener.github.io/areal/reference/aw_interpolate.html) function. This example uses a more 'manual' approach because this makes it possible to use the multi-step process described above, and also produces useful intermediate calculated data for mapping and visualization. However, we can use the single-function approach to double check our implementation of the areal interpolation approach for the count data (see @sec-check-areal-interp).

::: callout-warning
Areal interpolation may not work well in some cases (for example, in areas that are largely rural or near uninhabiated areas.) In these cases, it's possible to use more granular block-level population data from the decennial census to estimate population densities and distributions within block groups. See @sec-alternative-interpolate_pw for an approach which implements a method for doing that.
:::

First, we clip the census data to the water system boundaries:

```{r}
Expand Down Expand Up @@ -938,11 +944,23 @@ glimpse(water_system_demographics[,1:20])

```{r}
#| label: tbl-water-sys-demographics-rev
#| tbl-cap: "Water System Demographics"
#| tbl-cap: "Estimated Water System Demographics"
#| tbl-cap-location: top
pct_format <- label_percent(accuracy = 0.01)
water_system_demographics %>%
kable(caption = 'A Caption') %>%
st_drop_geometry() %>%
mutate(across(
.cols = ends_with('_percent'),
.fns = ~ pct_format(. / 100))
) %>%
rename_with(.cols = everything(),
.fn = ~ str_replace_all(., pattern = '_', replacement = ' ') %>%
str_to_title(.)) %>%
kable(align = 'c',
format.args = list(big.mark = ',')
) %>%
scroll_box(height = "400px")
```

Expand Down Expand Up @@ -1256,14 +1274,13 @@ if (file_update_interpolate == TRUE) {
This section is in progress.
:::

[TODO: Insert Shiny App (iframe)]
\[TODO: Insert Shiny App (iframe)\]

```{=html}
<!-- comment
<iframe width="780" height="500" src="https://cawaterdatadive.shinyapps.io/[app-name]/" title="Estimated Water System Demographics"></iframe>
end of comment -->
```

For simplicity, this section will focus on presenting estimated demographics for some of the largest water suppliers in the Sacramento county region (results for small water systems may not be very accurate and should be used with some caution - see @sec-check-pop-estimated-reported and @sec-small-area-estimates for more investigation of the results for small systems).

```{r}
Expand Down Expand Up @@ -1472,7 +1489,8 @@ As a check, we can add a column to the interpolated dataset (which we'll call `p
water_system_demographics_check <- water_system_demographics %>%
left_join(water_systems_sac %>%
st_drop_geometry() %>%
select(water_system_name, water_system_population_reported),
select(water_system_name, water_system_population_reported,
water_system_service_connections),
by = 'water_system_name')
water_system_demographics_check <- water_system_demographics_check %>%
Expand All @@ -1483,42 +1501,72 @@ water_system_demographics_check <- water_system_demographics_check %>%
.after = water_system_population_reported)
```

For water systems with a small population and/or service area, the estimated demographics may not match the reported population numbers in the water system dataset very well. You can see this in @tbl-pop-est-small by comparing the `population_reported` field, which contains the total population values from the water supplier dataset, with the `population_estimated` field, which contains the total population estimated from the census data; the difference between the two is summarized in the `population_percent_difference` field. This probably indicates that, for small areas, some adjustments and/or further analysis may be needed, and the preliminary estimated values should be treated with some caution/skepticism.

Note: See @sec-small-area-estimates below for some more investigation into water systems whose estimated population is at or near zero.
For larger water systems, the estimated population values seem to be roughly in line with the population numbers in the original dataset-- you can see this in the upper rows of @tbl-pop-est-large.

```{r}
#| label: tbl-pop-est-small
#| tbl-cap: "10 Smallest Water Systems by Population"
#| label: tbl-pop-est-large
#| tbl-cap: "Water Systems Sorted by Reported Population (Largest to Smallest)"
pct_format <- label_percent(accuracy = 0.01)
water_system_demographics_check %>%
arrange(water_system_population_reported) %>%
slice(1:10) %>%
select(water_system_name,
population_reported = water_system_population_reported,
population_estimated = population_total_count,
population_percent_difference) %>%
st_drop_geometry() %>%
kable()
arrange(desc(water_system_population_reported)) %>%
select(water_system_name,
water_system_service_connections,
water_system_population_reported,
population_total_count,
population_percent_difference,
) %>%
mutate(population_percent_difference = pct_format(
population_percent_difference / 100)) %>%
rename('Water System Name' = water_system_name,
'Service Connections' = water_system_service_connections,
'Estimated Population' = population_total_count,
'Reported Population' = water_system_population_reported,
'Percent Difference' = population_percent_difference,
) %>%
kable(align = 'c',
format.args = list(big.mark = ',')
) %>%
scroll_box(height = "400px")
```

But for larger water systems, the estimated population values seem to be more in line with the population numbers in the original dataset. You can see this in @tbl-pop-est-large by, as above, comparing the `population_reported` field, which contains the total population values from the water supplier dataset, with the `population_estimated` field, which contains the total population estimated from the census data; the difference between the two is summarized in the `population_percent_difference` field.
But for water systems with a small population and/or service area, the estimated demographics may not match the reported population numbers from the water system dataset very well -- you can see this in the top rows of @tbl-pop-est-small. This probably indicates that, for small areas, some adjustments and/or further analysis may be needed, and the preliminary estimated values should be treated with some caution/skepticism.

Note: See @sec-small-area-estimates below for some more investigation into water systems whose estimated population is at or near zero.

```{r}
#| label: tbl-pop-est-large
#| tbl-cap: "10 Largest Water Systems by Population"
#| label: tbl-pop-est-small
#| tbl-cap: "Water Systems Sorted by Reported Population (Smallest to Largest)"
pct_format <- label_percent(accuracy = 0.01)
water_system_demographics_check %>%
arrange(desc(water_system_population_reported)) %>%
slice(1:10) %>%
select(water_system_name,
population_reported = water_system_population_reported,
population_estimated = population_total_count,
population_percent_difference) %>%
st_drop_geometry() %>%
kable()
arrange(water_system_population_reported) %>%
select(water_system_name,
water_system_service_connections,
water_system_population_reported,
population_total_count,
population_percent_difference,
) %>%
mutate(population_percent_difference = pct_format(population_percent_difference / 100)) %>%
rename('Water System Name' = water_system_name,
'Service Connections' = water_system_service_connections,
'Estimated Population' = population_total_count,
'Reported Population' = water_system_population_reported,
'Percent Difference' = population_percent_difference,
) %>%
kable(align = 'c',
format.args = list(big.mark = ',')
) %>%
scroll_box(height = "400px")
```




## Considerations for Detailed Population Estimates {#sec-detailed-pop-estimates}

::: callout-warning
Expand Down Expand Up @@ -1553,7 +1601,7 @@ This section is in progress.

### Simplified Method With MOE Estimates {#sec-alternative-simplified}

As noted above, determining the margin of error (MOE) for estimates computed using areal weighted interpolation to aggregate portions of census units that overlap the target area of interest may not be possible (more research may be needed). If it's necessary to compute MOEs for your aggregated values, and/or it's preferable to use a simpler approach that doesn't apply areal interpolation to assign fractional portions of census units to the target area, then a simplified method could be applied.
As noted above, determining the margin of error (MOE) for estimates computed using areal weighted interpolation to aggregate portions of census units that overlap the target area of interest may not be possible (more research may be needed). If it's necessary to compute MOEs for your aggregated values, and/or it's preferable to use a simpler approach that doesn't apply areal interpolation to assign fractional portions of census units to the target area, then a simplified method could be applied.

::: callout-tip
For guidance on how calculate MOEs for some types of derived estimates, see [this document](https://www.census.gov/content/dam/Census/library/publications/2020/acs/acs_general_handbook_2020_ch08.pdf).
Expand Down Expand Up @@ -1729,7 +1777,7 @@ mapview(census_data_acs %>%
pull(GEOID))),
alpha.regions = 0.3,
col.regions = 'grey80',
color = 'grey',
color = 'grey30',
# lwd = 1.3,
label = 'NAME',
layer.name = 'ACS Data - Not Used',
Expand Down Expand Up @@ -1765,42 +1813,58 @@ The `tidycensus` package also has a function for population weighted interpolati
Note that some water systems may not get an estimated value using this method, even if `NA` values are removed from the source data first.

```{r}
results_interpolate_pw <- interpolate_pw(from = census_data_acs %>%
# population_total_count median_household_income
filter(!is.na(population_total_count)) %>%
select(population_total_count),
to = water_systems_sac,
to_id = 'water_system_name',
extensive = TRUE, # use FALSE for median_household_income
weights = census_data_decennial,
# weight_placement = 'surface',
weight_column = 'population_total_count') %>%
# rename(median_household_income_interpolate_pw = median_household_income) # rename results field
rename(population_total_count_interpolate_pw = population_total_count) %>%
mutate(population_total_count_interpolate_pw = round(population_total_count_interpolate_pw,
0))
# sum(is.na(results_interpolate_pw$median_household_income_interpolate_pw))
# sum(is.na(results_interpolate_pw$population_total_count_interpolate_pw))
```

This returns `{r} sum(is.na(results_interpolate_pw$population_total_count_interpolate_pw))` `NA`s (it looks like those are small areas). @tbl-interpolate-pw-compare shows a comparison of the system populations estimated using `interpolate_pw` and the reported system populations:
water_system_demographics_interpolate_pw_count <- interpolate_pw(
from = census_data_acs %>%
filter(!is.na(population_total_count)) %>%
select(ends_with('_count')),
to = water_systems_sac,
to_id = 'water_system_name',
extensive = TRUE, # use TRUE for count data
weights = census_data_decennial,
# weight_placement = 'surface',
weight_column = 'population_total_count') %>%
mutate(across(
.cols = ends_with('_count'),
.fns = ~ round(.x, 0)
)) %>%
arrange(water_system_name)
```

This returns `NA`s for `{r} sum(is.na(water_system_demographics_interpolate_pw_count$population_total_count))` water systems (it looks like those are small systems). @tbl-interpolate-pw-compare shows a comparison of the system populations estimated using `interpolate_pw` and the reported system populations:

```{r}
#| label: tbl-interpolate-pw-compare
#| tbl-cap: "Results Comparison - estimated population with interpolate_pw() vs. reported population"
#| tbl-cap: "Results Comparison - estimated population with interpolate_pw() vs. reported population (Sorted Largest to Smallest by Reported Population)"
#| tbl-cap-location: top
results_interpolate_pw %>%
pct_format <- label_percent(accuracy = 0.01)
water_system_demographics_interpolate_pw_count %>%
select(water_system_name, population_total_count) %>%
st_drop_geometry() %>%
left_join(water_systems_sac %>%
st_drop_geometry() %>%
select(water_system_service_connections,
water_system_population_reported,
water_system_name),
by = 'water_system_name') %>%
relocate(water_system_population_reported,
.after = population_total_count_interpolate_pw) %>%
kable(caption = 'A Caption') %>%
arrange(desc(water_system_population_reported)) %>%
relocate(water_system_service_connections, water_system_population_reported,
.before = population_total_count) %>%
mutate(population_percent_difference =
round(100 * (population_total_count - water_system_population_reported) /
water_system_population_reported,
2),
.after = population_total_count) %>%
mutate(population_percent_difference = pct_format(
population_percent_difference / 100)
) %>%
rename('Service Connections' = water_system_service_connections,
'Reported Population' = water_system_population_reported,
'Estimated Population' = population_total_count,
'Percent Difference' = population_percent_difference) %>%
kable(align = 'c',
format.args = list(big.mark = ',')) %>%
scroll_box(height = "400px")
```

Expand Down

0 comments on commit 7929a54

Please sign in to comment.