# Age- and sex adjusted incidence rates

Age- and sex-adjusted incidence rates with poisson regression in R

In this first post I’m going to present a way of obtaining age- and sex-adjusted incidence rates using poisson regression in R. This will be similar to what is done in Stata as described here.

I’ve written a R function that’s available for download here. The script can be sourced ( `source("age_sex_adjust.R")`

) and then the function `age_sex_adjust()`

can be used as is. The code will also be described step by step below.

There are, as usual, several ways to calculate adjusted incidence rates in R. I’ve chosen to use the package stdReg by Arvid Sjölander because it has a lot of nice features and useful implications in causal inference. Specifically, we will use the function `stdGlm()`

from stdReg to generate the the adjusted incidence rates.

But first we start off with a little bit of background on what an incidence rate is. It is simply a measure of a number of occurrences (a count) in a population over the total population time. For example, in a population of 10 people, each followed 1 year, there was one case of death. In that population, the incidence rate of death would 1 per 10 person years. In observational data, we often have larger cohorts with varying follow-up time and censoring. The calculation is of course the same, using the formula below:

\[\text{Incidence rate} = \frac{\text{Number of occurrences}}{\sum_{\text{Persons}}{\text{Time in study}}}\]

### Calculating crude incidence rate

To illustrate, we will now use the `colon`

dataset from the `survival`

package.

```
library(survival)
library(dplyr)
library(broom)
```

Running `?survival::colon`

tells us the following:

Data from one of the first successful trials of adjuvant chemotherapy for colon cancer

Variable | Explanation |
---|---|

id | Patient id |

study | 1 for all patients |

rx | 1 for all patients |

sex | 1 = male |

age | in years |

obstruct | colon obstruction by tumour |

perfor | performation of colon |

adhere | adherence to nearby organs |

nodes | number of positive lymph nodes |

time | days until event or censoring |

status | censoring status |

differ | tumour differentiation — 1 = well, 2 = moderate, 3 = poor |

extent | extent of local spread — 1 = submucosa, 2 = muscle, 3 = serosa, 4 = continious |

surg | time from surgery to registration — 0 = short, 1 = long |

node4 | more than 4 positive lymph nodes |

etype | event type — 1 = recurrence, 2 = death |

OK, so now that we understand the data, let’s start calculating crude incidence rates for death among the different treatment groups:

```
# Using the colon dataset from the survival package
# Only keep records related to the death outcome
colon_death <- colon %>% dplyr::filter(etype == 2)
# Time is divided by 365.25/100 to get the time in days variable
# first to years, then to 100 person-years
colon_death %>% group_by(rx) %>%
summarise(Events = sum(status),
Time = sum(time/365.25/100),
Rate = Events / Time,
lower = poisson.test(Events, Time)$conf.int[1],
upper = poisson.test(Events, Time)$conf.int[2])
```

```
## # A tibble: 3 x 6
## rx Events Time Rate lower upper
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Obs 168 13.8 12.2 10.4 14.2
## 2 Lev 161 13.7 11.7 10.0 13.7
## 3 Lev+5FU 123 15.0 8.22 6.83 9.80
```

Now we compare to the calculated rates with rates obtained from the `survRate()`

function from the biostat3 package:

```
library(biostat3)
survRate(Surv(time/365.25/100, status) ~ rx, data = colon_death) %>%
dplyr::select(rx, event, tstop, rate, lower, upper) %>%
as_tibble() %>%
dplyr::rename(Events = event,
Time = tstop,
Rate = rate)
```

```
## # A tibble: 3 x 6
## rx Events Time Rate lower upper
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Obs 168 13.8 12.2 10.4 14.2
## 2 Lev 161 13.7 11.7 10.0 13.7
## 3 Lev+5FU 123 15.0 8.22 6.83 9.80
```

Good, the incidence rates are identical. The observational patients had an mortality incidence rate of 12.2 per 100 person-years, compared to the Lev+5-FU treated patients with an incidence rate of 8.22 per 100 person-years. Now, let’s try and repeat these results with poisson regression.

### Obtaining estimated incidence rates using poisson regression

Here we use the broom package `tidy`

function to obtain exponentiated estimates:

```
poisson_fit <- glm(status ~ rx + offset(log(time/365.25/100)),
family = poisson, data = colon_death)
tidy(poisson_fit, exponentiate = T, conf.int = T)
```

```
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.2 0.0772 32.4 3.13e-230 10.4 14.1
## 2 rxLev 0.965 0.110 -0.324 7.46e- 1 0.777 1.20
## 3 rxLev+5FU 0.675 0.119 -3.32 9.16e- 4 0.534 0.850
```

The Intercept estimate here is the estimated IR for the reference level, i.e. the Obs group.

To get estimated IR of Lev+5FU treated:

```
lev_5fu <- predict(poisson_fit,
newdata = data.frame(rx = "Lev+5FU", time = 36525),
type = "link", se.fit = T)
as_tibble(lev_5fu) %>% summarise(Treatment = "Lev+5FU",
IR = exp(fit),
lower = exp(fit - (1.96 * se.fit)),
upper = exp(fit + (1.96 * se.fit)))
```

```
## # A tibble: 1 x 4
## Treatment IR lower upper
## <chr> <dbl> <dbl> <dbl>
## 1 Lev+5FU 8.22 6.88 9.80
```

Here, the confidence interval needs to be calculated on the \(\log\) scale and then exponentiated back. This will cause the confidence interval to not be centered around the estimate.

A poisson model is actually modeling \(\log\text{incidence rates (ratios)}\) when we use the time variable as an offset. Therefore, we can include covariates in the model to be accounted for, such as age and sex.

### Age- and sex-adjusted incidence rates using poisson regression

First, we’ll do it using my `age_sex_adjust()`

function

```
source("age_sex_adjust.R")
# Usage: age_sex_adjust(data, group, age, sex, event, time)
age_sex_adjust(colon_death, rx, age, sex, status, time)
```

```
## # A tibble: 3 x 4
## rx IR lower upper
## <chr> <dbl> <dbl> <dbl>
## 1 Obs 12.2 10.2 14.7
## 2 Lev 11.8 10.1 13.8
## 3 Lev+5FU 8.25 6.99 9.73
```

Here we see that the adjusted rates are very similar to the crude rates calculated above. Since this data comes from a randomized trial, this is expected and can be taken as a sign that the randomization worked.

Now, let’s do the some thing but without using the ready made function to see how it works under the hood.

```
poisson_fit <- glm(status ~ rx * I(age^2) + sex,
offset = log(time/365.25/100),
data = colon_death,
family = poisson)
std_fit <- stdGlm(poisson_fit, data = transform(colon_death, time = 36525), X = "rx")
std_sum <- summary(std_fit, CI.type = "log")
std_sum[["est.table"]] %>%
as_tibble(rownames = "rx") %>%
transmute(Treatment = rx,
IR = Estimate,
lower = `lower 0.95`,
upper = `upper 0.95`)
```

```
## # A tibble: 3 x 4
## Treatment IR lower upper
## <chr> <dbl> <dbl> <dbl>
## 1 Obs 12.2 10.2 14.7
## 2 Lev 11.8 10.1 13.8
## 3 Lev+5FU 8.25 6.99 9.73
```

The numbers are identical to the ones obtained from the `age_sex_adjust()`

function, which is logical since we did the same thing as the function does.

A few finishing notes. Here I included age as a quadratic term, and as an interaction with exposure. These are modeling decisions one will have to take, however the model could have been only a main effects model such as:

\[\log\lambda = \beta_0 + \beta_1\text{rxLev} + \beta_2\text{rxLev+5FU} + \beta_3\text{age} + \beta_4\text{sex}\]

Regarding the interaction term, a good explanation was given in the Stata forum in this post.

For anyone who wants to read more, I recommend the course material from the PhD course **Biostatistics III** at Karolinska Institutet, available here.