## Generalised linear models, part 2: Math behind the logit link

Recall that *categorical* outcomes are ones that can only take fixed values, also known as events. Research questions on *binary or dichotomous* categorical outcomes (e.g. yes/no, alive/dead) investigate how exposure to a variable changes outcomes that are events. We explored these ideas in the last post on generalised linear models, part 1.

The *odds ratio* is a useful statistic to measure how exposures change events. Importantly, we can use generalised linear models and the logit link function to calculate the *ratio of the odds of an event in those exposed, to the odds of an event in those unexposed*. This is very nifty! Let’s see how it works.

### Understanding odds and odds ratios

A general form of a table of exposures and events looks like this:

Exposed | Unexposed | |
---|---|---|

Event | a | b |

Non-event | c | d |

where a, b, c and d are just numbers of things measured (people/animals/cells etc). So, this is a table of counts or frequencies. It is similar to the toy example from our last post on how sunniness affects ice cream consumption.

These numbers or counts can be used to calculate risks and odds. For now, we will focus on odds. For each exposure category, the:

- odds of an event in the exposed
- odds of an event in the unexposed

So, the ratio of the odds of an event in those exposed, to the odds of an event in those unexposed, i.e. the odds ratio, is:

Here, event outcomes are binary categorical. But so are exposures. That is, the units of the things measured (people/animal/cells) are either exposed or not exposed to an exposure. In real life, exposures can be experimentally manipulated (e.g. treatment or control interventions in randomised controlled trials) or observed (e.g. smoking exposures in lung cancer events).

So we can think of exposure as a binary categorical *predictor* (in the modeling sense of the word, not in the prophecy sense). In fact, we are interested in comparing frequencies of events between exposed and unexposed groups. If the exposed group is *dummy coded* as 1 and the unexposed group is dummy coded as 0, then a between-group difference is just the difference in events for a 1-unit increase in the exposure predictor.

If the odds of an event are equal in those exposed and those unexposed, the ratio of these two odds is 1. This is analogous to e.g. if continuous outcomes in treated and control groups are equal, the between-group mean difference is 0. We have to use ratios to compare frequencies of events between two groups because events are categorical and can only take fixed values. But a useful thing about generalised linear models and link functions is that predictors can be categorical or continuous; we’re not tied to analysing only one type of predictor.

### Examples of research questions that might use generalised linear models

For example, we might want to know how smoking increases the risk of lung cancer. Here, the outcome cancer is binary categorical: people either do or do not get cancer. To keep things simple, exposure to smoking could be treated as a binary categorical predictor: people either were or were not exposed to cigarette smoke. We could use a generalised linear model to determine the ratio of the odds of cancer in those exposed to smoking, to the odds of cancer in those not exposed to smoking. This gives the effect of a "1-unit increase in smoking exposure" (i.e. smoking vs not smoking) on risk of cancer.

But suppose we want to know how age increases the risk of lung cancer. The outcome cancer is still binary categorical. But age should be treated as a continuous predictor. If we think the effect of a 1-year increase in age on risk of cancer is meaningful, a generalised linear model could be used to determine the ratio of the odds of cancer in those older by a year, to the odds of cancer in those younger by a year. This gives the effect of a 1-year increase in age on risk of cancer.

### How the logit link works

The logit link converts events (specifically, the *probability of events*) into a *log of the odds of events* that is connected to a linear combination of exposures. We can use the logit link in a generalised linear model to calculate the ratio of the odds of an event in those exposed, to the odds of an event in those unexposed, i.e. the odds ratio.

Recall that odds and probabilities are related like so:

The logit link function converts probability of events into a new outcome that is connected to a linear combination of predictors. The inverse of the logit link back-converts the linear combination of predictors into original probability:

where,

- is the link function, and is its inverse
- , the probability of events, is the outcome
- is the linear predictor

The logit link function is just the log of the odds. That is, to take the logit of a probability: convert the probability to an odds, and log the odds. The is then equal to a linear predictor, which is a predictor modeled as a straight line defined by an intercept *a* and slope *b*.

We calculate odds from probabilities based on how they are related. So when f is the logit link,

In (4), the inverse of the logit function, the *logistic*, back-converts the linear predictor values into non-linear probability. This is done by *exponentiating* the linear predictor (Properties of natural logs, rule 1).

When exposure is defined such that the exposed group is coded as 1 and the unexposed group is coded as 0, the difference between the of events in exposed and unexposed groups is equivalent to a 1-unit increase in the linear predictor. But recall that the difference between the log of two numbers is also the log of their ratio (Properties of natural logs, rule 4). So, for a 1-unit increase in the linear predictor, we can express the difference in as:

Exponentiating both left and right sides of (6),

we get which (ta-da!) is the odds ratio, i.e. the ratio of the odds of an event in those exposed, to the odds of an event in those unexposed.

### References

McElreath R (2020) Statistical Rethinking: A Bayesian course with examples in R and Stan (2nd Ed) Florida, USA: Chapman and Hall/CRC, p 316-318.

Rabe-Hesketh S and Skrondal A (2008) Multilevel and longitudinal modeling using Stata (2nd Ed) Texas, USA: StataCorp LP, p 232-233.