# Age- and sex adjusted incidence rates

Age- and sex-adjusted incidence rates with poisson regression in R

In this first post I’m going to present a way of obtaining age- and sex-adjusted incidence rates using poisson regression in R. This will be similar to what is done in Stata as described here.

I’ve written a R function that’s available for download here. The script can be sourced ( source("age_sex_adjust.R") ) and then the function age_sex_adjust() can be used as is. The code will also be described step by step below.

There are, as usual, several ways to calculate adjusted incidence rates in R. I’ve chosen to use the package stdReg by Arvid Sjölander because it has a lot of nice features and useful implications in causal inference. Specifically, we will use the function stdGlm() from stdReg to generate the the adjusted incidence rates.

But first we start off with a little bit of background on what an incidence rate is. It is simply a measure of a number of occurrences (a count) in a population over the total population time. For example, in a population of 10 people, each followed 1 year, there was one case of death. In that population, the incidence rate of death would 1 per 10 person years. In observational data, we often have larger cohorts with varying follow-up time and censoring. The calculation is of course the same, using the formula below:

$\text{Incidence rate} = \frac{\text{Number of occurrences}}{\sum_{\text{Persons}}{\text{Time in study}}}$

### Calculating crude incidence rate

To illustrate, we will now use the colon dataset from the survival package.

library(survival)
library(dplyr)
library(broom)

Running ?survival::colon tells us the following:

Data from one of the first successful trials of adjuvant chemotherapy for colon cancer

Variable Explanation
id Patient id
study 1 for all patients
rx 1 for all patients
sex 1 = male
age in years
obstruct colon obstruction by tumour
perfor performation of colon
nodes number of positive lymph nodes
time days until event or censoring
status censoring status
differ tumour differentiation — 1 = well, 2 = moderate, 3 = poor
extent extent of local spread — 1 = submucosa, 2 = muscle, 3 = serosa, 4 = continious
surg time from surgery to registration — 0 = short, 1 = long
node4 more than 4 positive lymph nodes
etype event type — 1 = recurrence, 2 = death

OK, so now that we understand the data, let’s start calculating crude incidence rates for death among the different treatment groups:

# Using the colon dataset from the survival package

# Only keep records related to the death outcome
colon_death <- colon %>% dplyr::filter(etype == 2)

# Time is divided by 365.25/100 to get the time in days variable
# first to years, then to 100 person-years

colon_death %>% group_by(rx) %>%
summarise(Events = sum(status),
Time = sum(time/365.25/100),
Rate = Events / Time,
lower = poisson.test(Events, Time)$conf.int[1], upper = poisson.test(Events, Time)$conf.int[2])
## # A tibble: 3 x 6
##   rx      Events  Time  Rate lower upper
##   <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Obs        168  13.8 12.2  10.4  14.2
## 2 Lev        161  13.7 11.7  10.0  13.7
## 3 Lev+5FU    123  15.0  8.22  6.83  9.80

Now we compare to the calculated rates with rates obtained from the survRate() function from the biostat3 package:

library(biostat3)
survRate(Surv(time/365.25/100, status) ~ rx, data = colon_death) %>%
dplyr::select(rx, event, tstop, rate, lower, upper) %>%
as_tibble() %>%
dplyr::rename(Events = event,
Time = tstop,
Rate = rate)
## # A tibble: 3 x 6
##   rx      Events  Time  Rate lower upper
##   <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Obs        168  13.8 12.2  10.4  14.2
## 2 Lev        161  13.7 11.7  10.0  13.7
## 3 Lev+5FU    123  15.0  8.22  6.83  9.80

Good, the incidence rates are identical. The observational patients had an mortality incidence rate of 12.2 per 100 person-years, compared to the Lev+5-FU treated patients with an incidence rate of 8.22 per 100 person-years. Now, let’s try and repeat these results with poisson regression.

### Obtaining estimated incidence rates using poisson regression

Here we use the broom package tidy function to obtain exponentiated estimates:

poisson_fit <- glm(status ~ rx + offset(log(time/365.25/100)),
family = poisson, data = colon_death)
tidy(poisson_fit, exponentiate = T, conf.int = T)
## # A tibble: 3 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   12.2      0.0772    32.4   3.13e-230   10.4      14.1
## 2 rxLev          0.965    0.110     -0.324 7.46e-  1    0.777     1.20
## 3 rxLev+5FU      0.675    0.119     -3.32  9.16e-  4    0.534     0.850

The Intercept estimate here is the estimated IR for the reference level, i.e. the Obs group.

To get estimated IR of Lev+5FU treated:

lev_5fu <- predict(poisson_fit,
newdata = data.frame(rx = "Lev+5FU", time = 36525),
type = "link", se.fit = T)

as_tibble(lev_5fu) %>% summarise(Treatment = "Lev+5FU",
IR = exp(fit),
lower = exp(fit - (1.96 * se.fit)),
upper = exp(fit + (1.96 * se.fit)))
## # A tibble: 1 x 4
##   Treatment    IR lower upper
##   <chr>     <dbl> <dbl> <dbl>
## 1 Lev+5FU    8.22  6.88  9.80

Here, the confidence interval needs to be calculated on the $$\log$$ scale and then exponentiated back. This will cause the confidence interval to not be centered around the estimate.

A poisson model is actually modeling $$\log\text{incidence rates (ratios)}$$ when we use the time variable as an offset. Therefore, we can include covariates in the model to be accounted for, such as age and sex.

### Age- and sex-adjusted incidence rates using poisson regression

First, we’ll do it using my age_sex_adjust() function

source("age_sex_adjust.R")
# Usage: age_sex_adjust(data, group, age, sex, event, time)

age_sex_adjust(colon_death, rx, age, sex, status, time)
## # A tibble: 3 x 4
##   rx         IR lower upper
##   <chr>   <dbl> <dbl> <dbl>
## 1 Obs     12.2  10.2  14.7
## 2 Lev     11.8  10.1  13.8
## 3 Lev+5FU  8.25  6.99  9.73

Here we see that the adjusted rates are very similar to the crude rates calculated above. Since this data comes from a randomized trial, this is expected and can be taken as a sign that the randomization worked.

Now, let’s do the some thing but without using the ready made function to see how it works under the hood.

poisson_fit <- glm(status ~ rx * I(age^2) + sex,
offset = log(time/365.25/100),
data = colon_death,
family = poisson)

std_fit <- stdGlm(poisson_fit, data = transform(colon_death, time = 36525), X = "rx")

std_sum <- summary(std_fit, CI.type = "log")

std_sum[["est.table"]] %>%
as_tibble(rownames = "rx") %>%
transmute(Treatment = rx,
IR = Estimate,
lower = lower 0.95,
upper = upper 0.95)
## # A tibble: 3 x 4
##   Treatment    IR lower upper
##   <chr>     <dbl> <dbl> <dbl>
## 1 Obs       12.2  10.2  14.7
## 2 Lev       11.8  10.1  13.8
## 3 Lev+5FU    8.25  6.99  9.73

The numbers are identical to the ones obtained from the age_sex_adjust() function, which is logical since we did the same thing as the function does.
A few finishing notes. Here I included age as a quadratic term, and as an interaction with exposure. These are modeling decisions one will have to take, however the model could have been only a main effects model such as:

$\log\lambda = \beta_0 + \beta_1\text{rxLev} + \beta_2\text{rxLev+5FU} + \beta_3\text{age} + \beta_4\text{sex}$
Regarding the interaction term, a good explanation was given in the Stata forum in this post.

For anyone who wants to read more, I recommend the course material from the PhD course Biostatistics III at Karolinska Institutet, available here.

##### Michael Dismorr, MD
###### Medical doctor, PhD Candidate in cardiothoracic surgery

My research interests include epidemiology, survival analysis, and R programming.