bivariate_reexpression.qmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---

# Re-expressing values

```{r echo=FALSE}
source("libs/Common.R")
```

```{r echo = FALSE}
pkg_ver(c("dplyr", "tukeyedar"))
```

-----------------------------------

When bivariate model assumptions, such as homogeneity of spread or normally distributed residuals, are not met, changes in measurement scales through re-expression of one or both variables can help address model shortcomings. While there is no general rule about which variable to re-express (i.e., the $X$ or $Y$ variable), theoretical considerations should be taken into account if applicable. An example of an analysis that benefits from re-expression is highlighted in this section.

In this chapter, we will explore the relationship between median rent and median household income for the State of Florida. The data are pulled from the `tukeyedar` package as follows.

```{r class.source="eda"}
library(dplyr)

dat <- tukeyedar::incrent22 %>% 
  filter(State == "Florida")
```

## Re-expressing the data

A demonstration of the workflow will follow, with a focus on the analysis rather than the code. The functions and code chunks required to generate these plots have been covered in previous chapters.

To begin, we will fit a first-order polynomial to the data and examine its diagnostic plots.

```{r fig.height=3, fig.width=10, results='hide', echo = FALSE}
library(tukeyedar)

options(scipen = 99)
OP <- par(mfrow = c(1,3), cex = 1)
 M1 <- eda_lm(dat, Income, Rent, py = 1, robust = TRUE, poly = 1, show.par = FALSE,
              mean.l =  FALSE, sd = FALSE)
 plot(M1)
 eda_sl(M1)
par(OP)
```

The data indicates a clear increase in rent as a function of income, suggesting that a straight-line fit could serve as a reasonable initial estimate.

However, the residuals-dependence (R-D) plot reveals a non-random pattern, indicating that the model may not fully capture all aspects of the data. While the pattern hints at potential curvature, its precise nature remains unclear. The lack of a definitive direction from the R-D plot makes it challenging to determine specific modifications to the fitted model.

The spread-location (S-L) plot, however, provides stronger evidence of heteroscedasticity. The fitted LOESS curve exhibits a nearly linear, monotonically increasing trend, suggesting that the variance of residuals grows with the fitted values. 

The diagnostic plots point to two potential issues: (1) the model may not adequately capture the pattern in the data, and (2) the presence of heteroscedasticity. It is generally advisable to address one issue at a time, as resolving a dominant problem can sometimes mitigate others. Given that heteroscedasticity is the more prominent issue here, we will tackle it first. 

Heteroscedasticity can often be corrected by re-expressing the $Y$ variable. In this analysis, we will explore power transformations to achieve a more desirable S-L plot. The following three S-L plots represent first-order polynomial fits applied to the data after re-expressing the rent variable using the logarithm, the inverse power (-1), and the inverse squared power (-2). To preserve the nature of the relationship between $X$ and $Y$ (i.e. avoiding sign reversal caused by negative powers), we can apply a Box-Cox transformation or multiply rent by minus one.

```{r fig.height=3, fig.width=10, results='hide', echo = FALSE}

f1 <- function(p){
   M1 <- with(dat, lm(eda_re(Rent,p) ~ Income))
     eda_sl(M1)
     mtext(paste0("power = ",p), side = 3, line =1.2, adj=1, col = "blue")
}

OP <- par(mfrow = c(1,3), cex = 1)
sapply(c(0,-1,-2), FUN = function(x) f1(x))

par(OP)
```

A power transformation of -2 seems to do good job in stabilizing variance.

Next, we'll plot the fitted model and diagnostic plots with the transformed rent values.

```{r fig.height=3, fig.width=10, results='hide', echo = FALSE}
options(scipen = 1, digits = 3)
dat$Rent2 <- -dat$Rent^-2
OP <- par(mfrow = c(1,3), cex = 1)
#OP <- par(mfrow = c(1,3), cex =0.9, cex.axis = 0.8, cex.lab = 0.9)
 M1 <- eda_lm(dat, Income, Rent2, py = 1,  poly = 1, show.par = FALSE,
              mean.l =  FALSE, sd = FALSE, ylab = expression(-Rent^(-2)))
 plot(M1)
 eda_sl(M1)
par(OP)
```

The power transformation altered the pattern in the scatter plot, making the curvature hinted at in the earlier diagnostic plots more pronounced with the transformed rent variable. This provides a strong case for considering a curvilinear model. The next set of plots illustrates the results of fitting a second-order polynomial to the data.

```{r fig.height=3, fig.width=10, results='hide', echo = FALSE}
options(scipen = 1, digits = 3)
OP <- par(mfrow = c(1,3), cex = 1)
#OP <- par(mfrow = c(1,3), cex =0.9, cex.axis = 0.8, cex.lab = 0.9)
 M1 <- eda_lm(dat, Income, Rent2, py = 1, poly = 2, show.par = FALSE,
              mean.l =  FALSE, sd = FALSE, ylab = expression(-Rent^(-2)))
 plot(M1)
 eda_sl(M1, type = "dependence")
par(OP)
```

The nearly flat Loess curve in the R-D plot suggests that the second-order polynomial effectively captures the data's pattern. However, this improvement comes at a cost. The S-L plot indicates a monotonically decreasing variance with increasing fitted values. Recall from the chapter on S-L plots that when a model exhibits curvature, it can be helpful to generate a spread-dependence (S-D) plot with the independent variable ($X$) on the x-axis instead of the fitted values. The resulting S-D plot is shown below:

```{r fig.height=2.8, fig.width=2.8, results='hide', echo = FALSE}
eda_sl(M1, type = "dependence")
```

The S-D plot confirms the presence of heteroscedasticity.

To address the heteroscedasticity without altering the second-order polynomial fit, we revisit the choice of transformations. Experimenting with several power transformations, we find that a log transformation stabilizes the variance while preserving the curvilinear relationship.

```{r fig.height=3, fig.width=10, results='hide', echo = FALSE}
OP <- par(mfrow = c(1,3), cex = 1)
#OP <- par(mfrow = c(1,3), cex =0.9, cex.axis = 0.8, cex.lab = 0.9)
 M1 <- eda_lm(dat, Income, Rent, py = 0, poly = 2, show.par = FALSE,
              mean.l =  FALSE, sd = FALSE, ylab = "log(Rent)")
 plot(M1)
 eda_sl(M1, type = "dependence")
par(OP)
```

Combining the log transformation with the second-order polynomial fit strikes a reasonable balance between model fit and assumption adherence. The scatter plot shows that a "perfect" model is unlikely, as a few points deviate substantially from the modeled relationship. These outliers may represent counties where the relationship between rent and income differs from the majority, potentially wielding disproportionate influence on the model.

Another diagnostic plot yet to be explored is the Normal Q-Q plot of the residuals. This plot is shown below:

```{r fig.height=2.8, fig.width=2.8, results='hide', echo = FALSE}
eda_theo(M1$residuals, ylab = "Residuals", show.par = FALSE)
```

The Q-Q plot reveals symmetry in the residuals, with no evidence of skew.  Within the +/- 1 standard deviation range, the points stay close to the line, though the slight "S" shape at the tails indicates shorter than expected tails.

This example highlights the iterative nature of exploratory analysis. It also demonstrates that there is no single "correct" workflow nor a single "correct" answer.