Statistical rethinking 2023 (brms examples)

Melbourne biosciences notes

Load the packages required to run code

Code

library(tidyverse) # for tidy coding
library(brms) # for fitting stan models
library(patchwork) # for combining plots
library(ggdag) # drawing dags
library(tidybayes) # for bayesian data visualisation
library(bayesplot) # more bayes data vis
library(MetBrewer) # colours 
library(ggrepel) # for nice text on ggplots
library(loo) # for information criteria

$~$

Lecture 1: The Golem of Prague

$~$

Causal inference: the attempt to understand the scientific causal model using data.

It is prediction - Knowing a cause means being able to predict the consequences of an intervention
It is a kind of imputation - Knowing a cause means being able to construct unobserved counterfactual outcomes
What if I do this types of questions.

Causal inference is intimately related to the description of populations and the design of research questions. This is because all three depend upon causal models/knowledge.

Describing populations depends upon causal inference because nealry always description depends upon the sample of the population wich you are using to predict population charactetiscs.

$~$

DAGs (Directed acyclic graphs)

$~$

Heuristic causal models that clarify scientific thinking.
We can use these to produce appropriate statistical models.
These help us think about the science before we think about the data.
Helps with questions like “what variables should we include in an analysis?”
An integral part of the course that will come up over and over again.

An example:

Lets make a function to speed up the DAG making process. We’ll use this a lot

Code

# Lets make a function to speed up the DAG making process. We'll use this a lot

gg_simple_dag <- function(d) {
  
  d %>% 
    ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
    geom_dag_point(color = met.brewer("Hiroshige")[4]) +
    geom_dag_text(color = met.brewer("Hiroshige")[7]) +
    geom_dag_edges() + 
    theme_dag()
}

dagify( X ~ A + C,
        C ~ A + B,
        Y ~ X + C + B,  
        coords = tibble(name = c("A", "C", "B", "X", "Y"),
                         x = c(1, 2, 3, 1, 3),
                         y = c(2, 2, 2, 1, 1))) %>% 
  gg_simple_dag()

$~$

Golems

$~$

Statistical models are akin to Golems - clay robots brought to life by magic to follow specific tasks. The trouble is, they follow tasks extremely literally and are blind to their creators intent, so even those born out of the purest of intentions can cause great harm. Models are another form of robot; if we apply models within the wrong context they will not help us at all.

Ecological models are rarely designed to falsify null-hypotheses. This is because there are many possible non-null models or put another way there are no true null models in many systems. This is problematic because null models underlie how the majority of biologists do statistics!

A more appropriate question is to ask how multiple process models that we can identify are different.

Should falsify the explanatory model, not the null model. Make predictions and try and falify those. Karl Popper

$~$

Owls

$~$

Satistical modelling explanations often provide a brief introduction, then leave out many of the important details e.g. step 1: draw two circles, step 2: draw the rest of the fking owl.

We shall have a specific workflow for fitting models that we will document

Drawing the owl, or the scientific workflow can be broken down into 5 steps:

Theoretical estimand: what are you trying to do in your study in the first place?
Some scientific causal model should be identified: step 1 will be precisely defined in the context of a causal model.
Use 1 and 2 to build a statistical model.
Simulate the scientific casual model to validate that the statistical model from step 3 yields the theoretical estimand i.e. check that our stats model works.
Analyse the real data. Note that the real data are only introduced now.

$~$

Lecture 2: The garden of forking data

$~$

Globe tossing

To estimate the proportion of water on planet earth, we collect data by tossing a globe 10 times and record whether our left index finger lands on land or water.

Our results are shown in the code below

Code

globe_toss_data <- tibble(toss = c("l", "w", "l", "l", "w", "w", "w", "l", "w", "w")) 

globe_toss_data

# A tibble: 10 × 1
   toss 
   <chr>
 1 l    
 2 w    
 3 l    
 4 l    
 5 w    
 6 w    
 7 w    
 8 l    
 9 w    
10 w

Remember the owl workflow:

Define generative model of sample.
Design estimand.
Use 1 and 2 to build a statistical model.
Test (3) using (1).
Analyse sample and summarise.

$~$

Step 1

Some true proportion of water, $p$.

We can measure this indirectly using:

$N$: the number of globe tosses
$W$: the number of water observations
$L$: the number of land observations

Code

#Put dag here

#N affects L and W but not the other way around

# P also influences W and L

Bayesian data analysis

For each possible explanation of the data, count all the ways data can happen. Explanations with more ways to produce the data are more plausible (this is entropy).

Toss globe, probability $p$ of observing $W$, $1 - p$ of $L$.

Each toss is random because it is chaotic (there are forces that could be theoretically measured i.e. velocity, exact starting orientation, but we are not equipped to measure these in real time, so the process appears random).

Each toss is independent of one another.

What are the relative number of ways we could get the data we actually got, given the process that generates all the possible datasets that could’ve been born from the process?

$~$

A 4 sided globe (dice)

In Bayesian analysis: enumerate all possible outcomes.

Code

# code land as 0 and water as 1 and create possibility data
# create the dataframe (tibble)

d <- tibble(p_1 = 0,
            p_2 = rep(1:0, times = c(1, 3)),
            p_3 = rep(1:0, times = c(2, 2)),
            p_4 = rep(1:0, times = c(3, 1)),
            p_5 = 1)

d %>% 
  gather() %>% 
  mutate(x = rep(1:4, times = 5),
         possibility = rep(1:5, each = 4)) %>% 
  
  ggplot(aes(x = x, y = possibility, 
             fill = value %>% as.character())) +
  geom_point(shape = 21, size = 9) +
  scale_fill_manual(values = c("white", "navy")) +
  scale_x_continuous(NULL, breaks = NULL) +
  coord_cartesian(xlim = c(.75, 4.25),
                  ylim = c(.75, 5.25)) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(size = 18))

The data: the dice is rolled three times: water, land, water.

Consider all possible answers and how they would occur.

e.g. if we hypothesise that 25% of the earth is covered by water, what are all the possible ways to produce our sample?

Code

# create a tibble of all the possibilities per marble draw, with columns position (where to appear on later figures x axis), draw (what number draw is it? and how many for each number? where to appear on figures y axis) and fill (colour of ball for figure)

d <- tibble(position = c((1:4^1) / 4^0, 
                         (1:4^2) / 4^1, 
                         (1:4^3) / 4^2),
            draw     = rep(1:3, times = c(4^1, 4^2, 4^3)),
            fill     = rep(c("b", "w"), times = c(1, 3)) %>% 
              rep(., times = c(4^0 + 4^1 + 4^2)))

# we wish to make a path diagram, for which we will need connecting lines, create two more tibbles for these

lines_1 <- tibble(x    = rep((1:4), each = 4),
                  xend = ((1:4^2) / 4),
                  y    = 1,
                  yend = 2)

lines_2 <- tibble(x    = rep(((1:4^2) / 4), each = 4),
                  xend = (1:4^3) / (4^2),
                  y    = 2,
                  yend = 3)

# We’ve generated the values for position (i.e., the x-axis), in such a way that they’re all justified to the right, so to speak. But we’d like to center them. For draw == 1, we’ll need to subtract 0.5 from each. For draw == 2, we need to reduce the scale by a factor of 4 and we’ll then need to reduce the scale by another factor of 4 for draw == 3. The ifelse() function will be of use for that.

d <- d %>% 
  mutate(denominator = ifelse(draw == 1, .5,
                              ifelse(draw == 2, .5 / 4,
                                     .5 / 4^2))) %>% 
  mutate(position    = position - denominator)

lines_1 <- lines_1 %>% 
  mutate(x    = x - .5,
         xend = xend - .5 / 4^1)

lines_2 <- lines_2 %>% 
  mutate(x    = x - .5 / 4^1,
         xend = xend - .5 / 4^2)

# create the plot, using geom_segment to add the lines - note coord_polar() which gives th eplot a globe-like effect. scale_x_continuous and the y equivalent have been used to remove axis lables and titles

d %>% 
  ggplot(aes(x = position, y = draw)) +
  geom_segment(data  = lines_1,
               aes(x = x, xend = xend,
                   y = y, yend = yend),
               size  = 1/3) +
  geom_segment(data  = lines_2,
               aes(x = x, xend = xend,
                   y = y, yend = yend),
               size  = 1/3) +
  geom_point(aes(fill = fill),
             shape = 21, size = 4) +
  scale_fill_manual(values  = c("navy", "white")) +
  scale_x_continuous(NULL, limits = c(0, 4), breaks = NULL) +
  scale_y_continuous(NULL, limits = c(0.75, 3), breaks = NULL) +
  theme_minimal() +
  theme(panel.grid      = element_blank(),
        legend.position = "none") +
  coord_polar()

Prune the number of possible outcomes down to those that are consistent with the data.

e.g. there are 3 paths consistent with our dice rolling experiment for the given data and the given hypothesis.

Code

lines_1 <-
  lines_1 %>% 
  mutate(remain = c(rep(0:1, times = c(1, 3)),
                    rep(0,   times = 4 * 3)))

lines_2 <-
  lines_2 %>% 
  mutate(remain = c(rep(0,   times = 4),
                    rep(1:0, times = c(1, 3)) %>% 
                      rep(., times = 3),
                    rep(0,   times = 12 * 4)))

d <-
  d %>% 
  mutate(remain = c(rep(1:0, times = c(1, 3)),
                    rep(0:1, times = c(1, 3)),
                    rep(0,   times = 4 * 4),
                    rep(1:0, times = c(1, 3)) %>% 
                      rep(., times = 3),
                    rep(0,   times = 12 * 4))) 

# finally, the plot:
d %>% 
  ggplot(aes(x = position, y = draw)) +
  geom_segment(data  = lines_1,
               aes(x = x, xend = xend,
                   y = y, yend = yend,
                   alpha = remain %>% as.character()),
               size  = 1/3) +
  geom_segment(data  = lines_2,
               aes(x = x, xend = xend,
                   y = y, yend = yend,
                   alpha = remain %>% as.character()),
               size  = 1/3) +
  geom_point(aes(fill = fill, alpha = remain %>% as.character()),
             shape = 21, size = 4) +
  # it's the alpha parameter that makes elements semitransparent
  scale_alpha_manual(values = c(1/10, 1)) +
  scale_fill_manual(values  = c("navy", "white")) +
  scale_x_continuous(NULL, limits = c(0, 4), breaks = NULL) +
  scale_y_continuous(NULL, limits = c(0.75, 3), breaks = NULL) +
  theme_minimal() +
  theme(panel.grid      = element_blank(),
        legend.position = "none") +
  coord_polar()

Is 3 ways to produce the sample big or small, to find out, compare with other possibilities.

e.g. lets make another conjecture - all land or all water - we have a land and water observation so there are zero ways that the all land/water hypotheses are consistent with the data.

Code

# if we make two custom functions, here, it will simplify the code within `mutate()`, below
n_water <- function(x){
  rowSums(x == "W")
}

n_land <- function(x){
  rowSums(x == "L")
}

t <-
  # for the first four columns, `p_` indexes position
  tibble(p_1 = rep(c("L", "W"), times = c(1, 4)),
         p_2 = rep(c("L", "W"), times = c(2, 3)),
         p_3 = rep(c("L", "W"), times = c(3, 2)),
         p_4 = rep(c("L", "W"), times = c(4, 1))) %>% 
  mutate(`roll 1: water`  = n_water(.),
         `roll 2: land` = n_land(.),
         `roll 3: water`  = n_water(.)) %>% 
  mutate(`ways to produce` = `roll 1: water` * `roll 2: land` * `roll 3: water`)

t %>% 
  knitr::kable()

p_1	p_2	p_3	p_4	roll 1: water	roll 2: land	roll 3: water	ways to produce
L	L	L	L	0	4	0	0
W	L	L	L	1	3	1	3
W	W	L	L	2	2	2	8
W	W	W	L	3	1	3	9
W	W	W	W	4	0	4	0

Unglamorous basis of applied probability: Things that can happen more ways are more plausible.

This is Bayesian inference. Given a set of assumptions (hypotheses) the number of ways the numbers could have occurred accoridng to those assumptions (hypotheses) is the posterior probability distribution.

But we don’t really have enough evidence to be confident in a prediction. So lets roll the dice again. In bayes world a process called Bayesian updating exists, that saves you from running the draws again, in favour of just updating previous counts (or data). Bayesian updating is simple multiplication e.g. we roll another water, for the 3 water hypothesis there are 3 paths for this to occur so we multiply 9 x 3, resulting in 27 possible paths consistent with the new dataset.

Eventually though, the garden gets really big. This is where your computer comes in and it starts to make more sense to work with probabilities rather than counts.

Code

W <- sum(globe_toss_data == "w")
L <- sum(globe_toss_data == "l")
p <- c(0, 0.25, 0.5, 0.75, 1) # proportions W
ways <- sapply(p, function(q) (q*4)^W * ((1 - q)*4)^L)
prob <- ways/sum(ways)

posterior <- cbind(p, ways, prob) %>% 
  as_tibble() %>% 
  mutate(p = as.character(p))

posterior %>% 
ggplot(aes(x = p, y = prob)) +
  geom_col() +
  labs(x = "proportion water", y = "probability") +
  theme_minimal()

$~$

Test before you est

Code a generative simulation
Code an estimator
Test (2) with (1)

Build a simulation

Code

sim_globe <- function(p = 0.7, N =9) {
  sample(c("W", "L"), size = N, prob = c(p, 1-p), replace = TRUE)
}

# W and L are the possible observations

# N is number of tosses

# prob is the probabilities of water and land occurring

The simulation does this:

Code

sim_globe()

[1] "W" "W" "W" "W" "L" "W" "W" "W" "W"

Now lets test it on extreme settings

e.g. all water

Code

sim_globe(p=1, N = 11)

 [1] "W" "W" "W" "W" "W" "W" "W" "W" "W" "W" "W"

Looks good

We can also test how close the proportion of water produced in the simulation is to the specified p

Code

sum(sim_globe(p=0.5, N = 1e4) == "W") / 1e4

[1] 0.4986

Also looks good.

So based upon our generative model:

Ways for $p$ to produce $W, L = (4p)^W * (4 - 4p)^L $

Code

# function to compute posterior distirbution

compute_posterior <- function(the_sample, poss = c(0, 0.25, 0.5, 0.75, 1)) {
  W <- sum(the_sample == "W") # the number of W observed
  L <- sum(the_sample == "L") # the number of L observed
  ways <- sapply(poss, function(q) (q*4)^W * ((1-q)*4)^L)
  post <- ways/sum(ways)
  #bars <- sapply(post, function(q) make_bar(q))
  tibble(poss, ways, post = round(post, 3))#, bars)
}

We can then simulate the experiment many times with sim_globe

Code

compute_posterior(sim_globe())

# A tibble: 5 × 3
   poss  ways  post
  <dbl> <dbl> <dbl>
1  0        0 0    
2  0.25     3 0    
3  0.5    512 0.072
4  0.75  6561 0.927
5  1        0 0

This allows us to check that our model is doing what we think it is.

$~$

An infinite number of possibilties

Globes aren’t dice. There are an infinite number of possible proportions of the earth covered in water.

To get actual infinity we can turn to math.

The relative number of ways that nay value of $p$ could produce any sample of W and L observations. This is a well know distribution called the binomial distribution

\[p^W(1-p)^L \]

The only trick is normalising to make it a probability. A little calculus is needed (this produces the beta distirbution:

\[Pr(W, L|P) = \frac{(W + L)!}{W!L!}p^{W}(1 - p)^{L}\]

Note the normalising constant $\frac{(W + L)!}{W!L!}$

We can use the binomial sampling formula to give us the number of paths through the garden of forking data for this particular problem. That is given some value of P (akin to some number of blue marbles in the bag), the number of ways to see W and L can be worked out using the expression above.

We can use R to calculate this, assuming 6 globe tosses that landed on water out of 10 possible tosses and that P = 0.7:

Code

dbinom(6, 10, 0.7)

[1] 0.2001209

This is the relative number of ways that 6 out of 10 can happen given a value of 0.7 for p. For this to be meaningful, we need to work out a probability value for many other values of P. This gives our probability distribution.

Lets plot how bayesian updating works, as we add observations

Code

# add the cumulative number of trials and successes for water to the dataframe.

globe_toss_data <- globe_toss_data %>% mutate(n_trials = 1:10, 
                                              n_success = cumsum(toss == "w"))

# Struggling to follow this code, awesome figure produced though

sequence_length <- 50

globe_toss_data %>%
  expand(nesting(n_trials, toss, n_success),
         p_water = seq(from = 0, to = 1, length.out = sequence_length)) %>%
  group_by(p_water) %>%
  mutate(lagged_n_trials = lag(n_trials, k = 1),
         lagged_n_success = lag(n_success, k =1)) %>%
  ungroup() %>%
  mutate(prior = ifelse(n_trials == 1, .5,
                        dbinom(x = lagged_n_success,
                               size = lagged_n_trials,
                               prob = p_water)),
         likelihood = dbinom(x = n_success,
                             size = n_trials,
                             prob = p_water),
         strip = str_c("n = ", n_trials)
         ) %>%
  # the next three lines allow us to normalize the prior and the likelihood, 
  # putting them both in a probability metric
  group_by(n_trials) %>%
  mutate(prior = prior / sum(prior),
         likelihood = likelihood / sum(likelihood)) %>%
  # plot time
  ggplot(aes(x = p_water)) +
  geom_line(aes(y = prior), linetype = 2) +
  geom_line(aes(y = likelihood)) +
  scale_x_continuous("proportion water", breaks = c(0, 0.5, 1)) +
  scale_y_continuous("plausibility", breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~n_trials, scales = "free_y")

Posterior is continually updated as data points are added.

Each posterior is simply a multiplication of a set of diagonal lines, like that shown in the first panel. Whether the slope of the line is pos or neg depends on whether the observation is land or water.
Every posterior is a prior for the next observation, but the data order doesn’t make a difference, in that the posterior will always be the same for a given dataset. BUT this assumption can be violated if each observation is not independent.
Sample size is embodied within the shape of the posterior. Already been accounted for.

Some big things to grasp about bayes

No minimum sample size, because we have something called a prior. Note that estimation sucks with small samples, but that’s good! The model doesn’t draw too much from too little.
Shape of distribution embodies sample size
There are no point estimates e.g. means or medians. Everything is a distribution. You will never need to bootstrap again.
There is no one true interval - they just communicate the shape of the posterior distribution. No magical number e.g. 95%.

$~$

From posterior to prediction (brms time)

Here is a nice point to introduce brms the main modelling package that I use.

Let’s fit a globe tossing model, where we observe 24 water observations from 36 tosses.

Code

b1.1 <-
  brm(data = list(w = 24), 
      family = binomial(link = "identity"),
      w | trials(36) ~ 0 + Intercept,
      #prior(beta(1, 1), class = b, lb = 0, ub = 1),
      seed = 2,
      file = "fits/b02.01")

b1.1

 Family: binomial 
  Links: mu = identity 
Formula: w | trials(36) ~ 0 + Intercept 
   Data: list(w = 24) (Number of observations: 1) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.66      0.08     0.51     0.80 1.00     1681     1947

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

There’s a lot going on in that output. For now, focus on the ‘Intercept’ line. The intercept of a typical regression model with no predictors is the same as its mean. In the special case of a model using the binomial likelihood, the mean is the probability of a 1 in a given trial.

To use the posterior, we can sample from it using the as_draws_df function.

Code

as_draws_df(b1.1) %>% 
  mutate(n = "n = 36") %>%
  
  ggplot(aes(x = b_Intercept)) +
  geom_density(fill = "black") +
  scale_x_continuous("proportion water", limits = c(0, 1)) +
  theme(panel.grid = element_blank(),
        text = element_text(size = 16)) +
  facet_wrap(~ n)

Another way to do this is to use the fitted function

Code

f <-
  fitted(b1.1, 
         summary = F,
         scale = "linear") %>% 
  data.frame() %>% 
  set_names("proportion") %>% 
  as_tibble()

f

# A tibble: 4,000 × 1
   proportion
        <dbl>
 1      0.705
 2      0.562
 3      0.584
 4      0.667
 5      0.694
 6      0.706
 7      0.654
 8      0.684
 9      0.682
10      0.658
# … with 3,990 more rows

Plot the distribution

Code

f %>% 
  ggplot(aes(x = proportion)) +
  geom_density(fill = "grey50", color = "grey50") +
  annotate(geom = "text", x = .08, y = 2.5,
           label = "Posterior probability") +
  scale_x_continuous("probability of water",
                     breaks = c(0, .5, 1),
                     limits = 0:1) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())

Strikingly similar (well, exactly the same).

We can use this distribution of probabilities to predict histograms of water counts.

Code

# the simulation
set.seed(3) # so we get the same result every time

f <-
  f %>% 
  mutate(w = rbinom(n(), size = 36,  prob = proportion))

# the plot
f %>% 
  ggplot(aes(x = w)) +
  geom_histogram(binwidth = 1, center = 0,
                 color = "grey92", size = 1/10) +
  scale_x_continuous("number of water samples", breaks = 0:40 * 4) +
  scale_y_continuous(NULL, breaks = NULL, limits = c(0, 450)) +
  ggtitle("Posterior predictive distribution") +
  coord_cartesian(xlim = c(8, 36)) +
  theme_minimal()

$~$

Lecture 3: Geocentric models

$~$

It is entirely possible for a completely false model to make very accurate predictions.

Geocentrism is Mcelreath’s example: the idea that the planets orbit the earth and the model behind it can very accurately predict where Mars will be in the nights sky, including when it will be in retrograde (appearing to orbit in the opposite direction).

However, just because a model allows us to make accurate predictions does not mean it is correct. We now know that the planets orbit the sun, and this more correct model can also explain mar’s apparent retrograde movement.

The relevance of this anecdote is that prediction without explanation is thwart with danger, but at the same time models like the geocentric can be very helpful.

Linear regression is similar.

They are often descriptively accurate but mechanistically wrong.
Few things in nature are perfectly linear as is assumed.
They are useful but we must remember their limits

$~$

What are linear models

$~$

Simple statistical golems that model the mean and the variance of a variable. That’s it.
The mean is sum weighted sum of other variables. As those variables change, the mean and variance changes.
Anovas, Ancovas, t-tests are all linear models.

The normal distribution

Counts up all the ways the observations can happen given a set of assumptions.
Given some mean and variance the normal distribution gives you the relative number of ways the data can appear.

The normal distribution is the norm because:

It is very common in nature.
It’s easy to calculate
Very conservative assumptions (spreads probability out more than any other distribution, reducing the risk of mistake, at the expense of accuracy)

To understand why the normal distribution is so common consider a soccer pitch mid-line, and individuals flipping coins that dictate whether they should step to the left or the right.

Many individuals remain close to the line, while a few move towards the extremes. Simply, there are more ways to get a difference of 0 than there is any other result. This creates a bell curve and summarises processes to mean and variance. The path of one individual is shaded black in the figure below.

Code

# lets simulate this scenario

set.seed(4)

pos <-
  replicate(100, runif(16, -1, 1)) %>% # this is the sim
  as_tibble() %>%
  rbind(0, .) %>% # add a row of zeros above simulation results
  mutate(step = 0:16) %>% # creates a step column
  gather(key, value, -step) %>% # convert data to long format
  mutate(person = rep(1:100, each = 17)) %>% # person IDs added
  # the next lines allow us to make cumulative sums within each person
  group_by(person) %>%
  mutate(position = cumsum(value)) %>%
  ungroup() # allows more data manipulation

ggplot(data = pos,
       aes(x = step, y = position, group = person)) +
  geom_vline(xintercept = c(4, 8, 16), linetype = 2) +
  geom_line(aes(colour = person < 2, alpha = person < 2)) +
  scale_colour_manual(values = c("skyblue3", "black")) +
  scale_alpha_manual(values = c(1/5, 1)) +
  scale_x_continuous("step number", breaks = c(0, 4, 8, 12, 16)) +
  theme_minimal() +
  theme(legend.position = "none")

The density plot for this scenario at step 16:

Code

# first find the sd

sd <-
  pos %>%
  filter(step == 16) %>%
           summarise(sd = sd(position))

pos %>%
  filter(step == 16) %>%
  ggplot(aes(x = position)) +
  stat_function(fun = dnorm,
                args = list(mean = 0, sd = 2.18),
                linetype = 2) +
  geom_density(colour = "transparent", fill = "dodgerblue", alpha = 0.5) +
  coord_cartesian(xlim = c(-6, 6)) +
  labs(title = "16 steps",
       y = "density")

Why normal? The generative perspective:

Nature makes bell curves whenever it adds things together.
This dampens fluctuations - large fluctuations are cancelled out by small fluctuations.
Symmetry arises from the addition process.

Inferential perspective:

If all you know are the mean and the variance, then the least surprising distribution is gaussian.

Note: if you just want mean or variance, a variable does not have to be normally distributed for the normal distribution to be useful

Oh and go with the FLOW. This is hard, you won’t understand everything, but keep flowing forward

$~$

Weekly owl reminder

State a clear question
Sketch your causal assumptions
Use the sketch to define generative model
Use generative model to build estimator
Profit (real model)

$~$

Time for some data: Kalahari heights

$~$

To get us started with linear regression, lets load in the Kalahari dataset from McElreath’s rethinking package.

Code

library(rethinking)
data(Howell1)
Kalahari_data <- as_tibble(Howell1)

Now lets detach rethinking, we need to do this so brms always works

Code

rm(Howell1)
detach(package:rethinking, unload = T)
library(brms)

Right lets have a quick look at the data

Code

Kalahari_data

# A tibble: 544 × 4
   height weight   age  male
    <dbl>  <dbl> <dbl> <int>
 1   152.   47.8    63     1
 2   140.   36.5    63     0
 3   137.   31.9    65     0
 4   157.   53.0    41     1
 5   145.   41.3    51     0
 6   164.   63.0    35     1
 7   149.   38.2    32     0
 8   169.   55.5    27     1
 9   148.   34.9    19     0
10   165.   54.5    54     1
# … with 534 more rows

And we can check out how each variable is distributed like so:

Code

Kalahari_data %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("height", "weight", "age", "male"))) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(bins = 10) +
  facet_wrap(~ name, scales = "free", ncol = 1)

The owl

1. In this lecture we are going to focus on describing the association between height and weight in adults.

2. Scientific model: how does height affect weight?

Height influences weight, but not the other way around. Height is causal.

$W = f(H)$ this means that weight is some function of height.

Code

dagify( W ~ H + U,  
        coords = tibble(name = c("H", "W", "U"),
                         x = c(1, 2, 3),
                         y = c(1, 1, 1))) %>% 
  gg_simple_dag()

Note the $U$ in the DAG - an unobserved influence on weight.

Build a generative model

$W = \beta H + U$

Lets simulate the data

Code

sim_weight <- function(H, b, sd){
  U <- rnorm(length(H), 0, sd)
  W <- b*H + U
  return(W)
}

Run the simulation and plot. We’ll need values for heights, a $\beta$ value and some value of sd for weight in kgs

Code

sim_data <- tibble(H = runif(200, min = 130, max = 170)) %>% 
  mutate(W = sim_weight(H, b = 0.5, sd = 5))

# plot

sim_data %>% 
  ggplot(aes(x = H, y = W)) +
   geom_point(color = "salmon", shape = 1.5, stroke =2.5, size = 3, alpha = 0.9) +
  theme_classic() +
  labs(x = "Height (cm)", y = "Weight (kgs)") +
  theme(text = element_text(size = 16))

Describing our model

$W_{i} = \beta H_{i} + U_i$ is our equation for expected weight

$U_{i} = Normal(0,\sigma)$ is the Gaussian error with sd $\sigma$

$H_{i} =$ Uniform$(130, 170)$ means that all values are equally likely for height between 130-170

$i$ is an index and here represents individuals in the dataset

= indicates a deterministic relationship

~ indicates that something is “distributed as”

$~$

Build estimator

A linear model:

$E(W_i|H_i) = \alpha + \beta H_{i}$

$\alpha$ = the intercept

$\beta$ = the slope

Our estimator:

$W_{i}$ ~ $Normal(u_{i},\sigma)$

$u_{i}$ ~ $\alpha + \beta H_{i}$

In words: $W$ is distributed normally with mean that is a linear function of H

Priors

We can specify these to be very helpful. We simply want to design priors to stop the model hallucinating impossible outcomes e.g. negative weights.

Priors should express scientific knowledge, but softly. This is because the real process in nature is different to what we imagined and there needs to be room for this.

Some basic things we know about weight:

When height is zero, weight should be zero.

$\alpha$ ~ Normal(0, 10) will achieve this

Weight increases with height in humans on average. So $\beta$ should be positive
Weight in kgs is less than height in cms, so $\beta$ should be less than 1

$\beta$ ~ Uniform(0, 1) will achieve this

$\sigma$ must be positive

$\sigma$ ~ Uniform(0, 10)

Lets plot these priors

Code

n_lines <- 50

tibble(n = 1:n_lines,
       a = rnorm(n_lines, 0, 10),
       b = runif(n_lines, 0, 1)) %>% 
  expand(nesting(n, a, b), height = 130:170) %>% 
  mutate(weight = a + b * (height)) %>%
  
  # plot
  ggplot(aes(x = height, y = weight, group = n)) +
  geom_line(alpha = 1/1.5, linewidth = 1, colour = met.brewer("Hiroshige")[2]) +
  scale_x_continuous(expand = c(0, 0)) +
  coord_cartesian(ylim = c(0, 100),
                  xlim = c(130, 170)) +
  theme_classic()

Woah, ok the slope looks ok, but the intercept is wild.

This is because we set a very high sd value for $\alpha$

We can fix this, but for a problem like this, the data will overwhelm the prior.

$~$

Back to the owl: brms time

Lets validate our model

Let simulate again with our sim_weight() function

Code

sim_data_100 <- 
  tibble(H = runif(100, min = 130, max = 170)) %>% 
  mutate(W = sim_weight(H, b = 0.5, sd = 5))

Fit the model

Code

weight_synthetic_model <-
  brm(W ~ 1 + H,
      family = gaussian,
      data = sim_data_100,
      prior = c(prior(normal(0, 10), class = Intercept),
                prior(uniform(0, 1), class = b, lb = 0, ub = 1),
                prior(uniform(0, 10), class = sigma, lb = 0, ub = 10)),
      chains = 4, cores = 4, iter = 6000, warmup = 2000, seed = 1,
      file = "fits/weight_synthetic_model")

Get the model summary

Code

weight_synthetic_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: W ~ 1 + H 
   Data: sim_data_100 (Number of observations: 100) 
  Draws: 4 chains, each with iter = 6000; warmup = 2000; thin = 1;
         total post-warmup draws = 16000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    12.28      6.51    -0.70    25.06 1.00    15740    11165
H             0.42      0.04     0.34     0.51 1.00    15754    11351

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     4.91      0.35     4.28     5.67 1.00    14738    10608

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Right so, H is close to 0.5 and sigma is very close to 4.91. These aren’t perfect because we have a smallish sample and relatively wild priors.

$~$

Step 5. Profit (with the real data)

Time to fit the actual model

Code

Kalahari_adults <-
  Kalahari_data %>% 
  filter(age > 18)

weight_kalahari_model <-
  brm(weight ~ 1 + height,
      family = gaussian,
      data = Kalahari_adults,
      prior = c(prior(normal(0, 10), class = Intercept),
                prior(uniform(0, 1), class = b, lb = 0, ub = 1),
                prior(uniform(0, 10), class = sigma, lb = 0, ub = 10)),
      chains = 4, cores = 4, iter = 6000, warmup = 2000, seed = 1,
      file = "fits/weight_kalahari_model")

weight_kalahari_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height 
   Data: Kalahari_adults (Number of observations: 346) 
  Draws: 4 chains, each with iter = 6000; warmup = 2000; thin = 1;
         total post-warmup draws = 16000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   -51.66      4.55   -60.54   -42.77 1.00    16472    11740
height        0.63      0.03     0.57     0.68 1.00    16478    11912

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     4.27      0.16     3.97     4.60 1.00    15291    11571

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

$~$

OBEY THE LAW

$~$

Law 1. Parameters are not independent of one another and cannot always be interpreted independently. They all act simultaneously on predictors (think about this from the context of the fitted() function).

Instead we can push out posterior predictions from the model and describe/interpret those.

Plot the sample
Plot the posterior mean
Plot the uncertainty of the mean
Plot the uncertainty of predictions

Code

# use fitted to get posterior predictions

height_seq <- 
  tibble(height = 135:180) %>% 
  mutate(height_standard = height - mean(Kalahari_adults$height))

# add uncertainty intervals

mu_summary <-
  fitted(weight_kalahari_model,
         newdata = height_seq) %>%
  as_tibble() %>%
  bind_cols(height_seq)

# add prediction intervals

pred_weight <-
  predict(weight_kalahari_model,
          newdata = height_seq) %>%
  as_tibble() %>%
  bind_cols(height_seq)

# make plot 

Kalahari_adults %>%
  ggplot(aes(x = height)) +
   geom_ribbon(data = pred_weight, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "salmon", alpha = 0.5) +
  geom_point(aes(y = weight), color = "salmon", shape = 1.5, stroke =2, size = 1.5, alpha = 0.9) +
  geom_smooth(data = mu_summary,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "salmon", color = "black", alpha = 0.7, size = 1/2) +
  coord_cartesian(xlim = range(Kalahari_adults$height)) +
  labs(x = "Height (cm)", y = "Weight (kgs)") +
  theme_classic() +
  theme(panel.grid = element_blank())

This plot shows 1) the raw data, 2) the predicted relationship between height and weight with 3) 95% uncertainty intervals for the mean and 4) 95% prediction intervals for where data points are predicted to fall within.

Predict() reports prediction intervals, which are simulations that are the joint consequence of both the mean and sigma, unlike the results of fitted(), which only reflect the mean.

$~$

Lecture 4: Categories and Curves

$~$

Curves with lines

$~$

Many causal relationships are obviously not linear.

Linear models can fit curves quite easily, but be wary that this is geocentric (not mechanistic).

For example lets examine the full Kalahari dataset, which includes children.

Code

Kalahari_data %>%
  ggplot(aes(x = height)) +
  geom_point(aes(y = weight), colour = "salmon", shape = 1.5, stroke =2, size = 2.5, alpha = 0.6) +
  coord_cartesian(xlim = range(Kalahari_data$height)) +
  labs(x = "Height (cm)", y = "Weight (kgs)") +
  theme_classic() +
  theme(panel.grid = element_blank(),
        legend.position = "none")

Two popular strategies

Polynomials
Splines and generalised additive models (nearly always better)

$~$

Polynomials

$~$

$\mu_{i} = \alpha + \beta_{1}x_{i} + \beta_{2}x^2_{i}$

Note the $x^{2}$, you can use higher and higher order terms and the model becomes more flexible.

Issues

There is unhelpful asyymetry to polynomials. For example, for a second order term, the model believes the data creates a parabola and it’s very hard to convince it otherwise.
The uncertainty at the edges of the data can be huge, making extrapolation and prediction very difficult.
There is no local smoothing so any data point can massively change the shape of a curve.

For a second order term, the model believes the data creates a parabola and it’s very hard to convince it otherwise.

Fit the models with second and third order terms

Code

# First it is worth standardising height
Kalahari_data <-
Kalahari_data %>%
  mutate(height_s = (height - mean(height)) / sd(height)) %>% 
  mutate(height_s2 = height_s^2,
         height_s3 = height_s^3)

# fit quadratic model

quadratic_kalahari <- 
  brm(data = Kalahari_data, 
      family = gaussian,
      weight ~ 1 + height_s + height_s2,
      prior = c(prior(normal(60, 10), class = Intercept),
                prior(lognormal(0, 1), class = b, coef = "height_s"),
                prior(normal(0, 1), class = b, coef = "height_s2"),
                prior(uniform(0, 50), class = sigma)),
      iter = 30000, warmup = 29000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/quadratic_kalahari")

# get predictions

height_seq <- 
  tibble(height_s = seq(from = -4, to = 4, length.out = 30)) %>% 
  mutate(height_s2 = height_s^2)

fitd_quad <-
  fitted(quadratic_kalahari, 
         newdata = height_seq) %>%
  data.frame() %>%
  bind_cols(height_seq)

pred_quad <-
  predict(quadratic_kalahari, 
          newdata = height_seq) %>%
  data.frame() %>%
  bind_cols(height_seq) 

p2 <-
  ggplot(data = Kalahari_data, 
       aes(x = height_s)) +
  geom_ribbon(data = pred_quad, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = fitd_quad,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/2) +
  geom_point(aes(y = weight),
             color = "salmon", shape = 1.5, stroke =2, size = 2.5, alpha = 0.33) +
  labs(y = "weight",
       subtitle = "quadratic") +
  coord_cartesian(xlim = range(-4, 4),
                  ylim = range(0, 100)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

# fit cubic model

cubic_kalahari <- 
  brm(data = Kalahari_data, 
      family = gaussian,
      weight ~ 1 + height_s + height_s2 + height_s3,
      prior = c(prior(normal(60, 10), class = Intercept),
                prior(lognormal(0, 1), class = b, coef = "height_s"),
                prior(normal(0, 1), class = b, coef = "height_s2"),
                prior(normal(0, 1), class = b, coef = "height_s3"),
                prior(uniform(0, 50), class = sigma)),
      iter = 40000, warmup = 39000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/cubic_kalahari")

# get predictions

height_seq <- 
  height_seq %>% 
  mutate(height_s3 = height_s^3)

fitd_cub <-
  fitted(cubic_kalahari, 
         newdata = height_seq) %>%
  as_tibble() %>%
  bind_cols(height_seq)

pred_cub <-
  predict(cubic_kalahari, 
          newdata = height_seq) %>%
  as_tibble() %>%
  bind_cols(height_seq) 

p3 <-
  ggplot(data = Kalahari_data, 
       aes(x = height_s)) +
  geom_ribbon(data = pred_cub, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = fitd_cub,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/2) +
  geom_point(aes(y = weight),
             color = "salmon", shape = 1.5, stroke =2, size = 2.5, alpha = 0.33) +
  labs(y = "weight",
       subtitle = "cubic") +
  coord_cartesian(xlim = range(-4, 4),
                  ylim = range(0, 100)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

p2 + p3

Note our use of the coef argument within our prior statements. Since $\beta_{1}$ and $\beta_{2}$ are both parameters of class = b within the brms set-up, we need to use the coef argument in cases when we want their priors to differ.

At the left extreme, the model predicts that weight will increase as height decreases. This is because we have told the model the data is curved.

$~$

Splines

$~$

A big family of functions designed to do local smoothing.

This means that only the points within a region determine the shape of the function in that region. In polynomials the whole shape is affected by each data point. Results in splines being far more flexible than polynomials.

Once again, they have no mechanistic reality.

What is a spline?

Used in drafting for architecture. They create wriggly lines.

B-splines: regressions on synthetic variables

$\mu_{i} = \alpha + w_{1}B_{i},1 + w_{2}B_{i},2 + w_{3}B_{i},3 +$

$w$ = the weight parameter - like a slope, determine the importance of the different variables for the predicted mean. These can overlap somewhat.

$B$ = the basis function - you make this - only have positive values in narrow regions of x-axis. They turn on weights in isolated regions of the x-axis.

$~$

Lecture 5: Elemental Confounds

$~$

Correlation in general is not surprising. In large data sets, every pair of variables has a statistically discernible non-zero correlation. But since most correlations do not indicate causal relationships, we need tools for distinguishing mere association from evidence of causation. This is why so much effort is devoted to multiple regression, using more than one predictor variable to simultaneously model an outcome.

Waffle house causes divorce?

Code

data(WaffleDivorce, package = "rethinking") # this loads WaffleDivorce as a dataframe without loading rehtinking

# Standarise as discussed below. We can use the rethinking package

Waffle_data <-
  WaffleDivorce %>% 
  mutate(d = rethinking::standardize(Divorce),
         m = rethinking::standardize(Marriage),
         a = rethinking::standardize(MedianAgeMarriage))

# time to plot

Waffle_data %>%
  ggplot(aes(x = WaffleHouses/Population, y = Divorce)) +
  stat_smooth(method = "lm", fullrange = T, size = 1/2,
              color = "firebrick4", fill = "firebrick", alpha = 1/5) +
  geom_point(size = 3.5, color = "firebrick4", alpha = 1/2) +
  geom_text_repel(data = Waffle_data %>% filter(Loc %in% c("ME", "OK", "AR", "AL", "GA", "SC", "NJ")),  
                  aes(label = Loc), 
                  size = 3, seed = 1042) +  # this makes it reproducible
  scale_x_continuous("Waffle Houses per million", limits = c(0, 55)) +
  ylab("Divorce rate") +
  theme_bw() +
  theme(panel.grid = element_blank())

There is a statistical association! But surely this isn’t casual. We’ll get back to this later.

$~$

The four elemental confounds

$~$

Confounds: some feature of the sample and how we use it that misleads us.

There are four major confounds:

The fork
The Pipe
The Collider
The Descendant

Code

d1 <- 
  dagify(X ~ Z,
         Y ~ Z,
         coords = tibble(name = c("X", "Y", "Z"),
                         x = c(1, 3, 2),
                         y = c(2, 2, 1)))

d2 <- 
  dagify(Z ~ X,
         Y ~ Z,
         coords = tibble(name = c("X", "Y", "Z"),
                         x = c(1, 3, 2),
                         y = c(2, 1, 1.5)))

d3 <- 
  dagify(Z ~ X + Y,
         coords = tibble(name = c("X", "Y", "Z"),
                         x = c(1, 3, 2),
                         y = c(1, 1, 2)))

d4 <- 
  dagify(Z ~ X + Y,
         D ~ Z,
         coords = tibble(name = c("X", "Y", "Z", "D"),
                         x = c(1, 3, 2, 2),
                         y = c(1, 1, 2, 1.05)))

p1 <- gg_simple_dag(d1) + labs(subtitle = "The Fork")
p2 <- gg_simple_dag(d2) + labs(subtitle = "The Pipe")
p3 <- gg_simple_dag(d3) + labs(subtitle = "The Collider")
p4 <- gg_simple_dag(d4) + labs(subtitle = "The Descendant")

(p1 | p2 | p3 | p4) &
  theme(plot.subtitle = element_text(hjust = 0.5)) &
  plot_annotation(title = "The four elemental confounds")

$~$

The Fork

$~$

X and Y are associated because they share a common cause. However, this association disappears once we account for Z. That means that Y is independent of X once we account for Z.

Code

p1

Note that once we stratify for Z, the noise affecting X and Y can become independent.

Let’s simulate:

Code

n <- 1000

tibble(Z = rbernoulli(n, 0.5)) %>% 
  mutate(Z = if_else(Z == "TRUE", 1, 0),
         X = rbernoulli(n, (1 - Z) * 0.1 + Z * 0.9),
         Y = rbernoulli(n, (1 - Z)* 0.1 + Z * 0.9),
         X = if_else(X == "TRUE", 1, 0),
         Y = if_else(Y == "TRUE", 1, 0)) %>% 
  count(X, Y)

# A tibble: 4 × 3
      X     Y     n
  <dbl> <dbl> <int>
1     0     0   425
2     0     1    97
3     1     0    77
4     1     1   401

See how this works? Our simulation specifies no relation between X and Y except through Z, but they are super positively correlated.

Now for a continuous example

Code

n <- 300

fork_sim_data <- 
  tibble(Z = rbernoulli(n)) %>% 
  mutate(Z = if_else(Z == "TRUE", 1, 0),
         X = rnorm(n, 2*Z - 1),
         Y = rnorm(n, 2 * Z -1)) 

fork_sim_data %>%
  ggplot(aes(X, Y)) +
  geom_point(aes(colour = as.factor(Z)), shape = 1.5, stroke =2, size = 2.5, alpha = 0.8) +
  geom_smooth(aes(colour = as.factor(Z)), method = "lm", se = FALSE, linewidth = 1.5) +
  geom_smooth(colour = "black", method = "lm", se = FALSE) +
  scale_colour_manual(values = c(met.brewer(name = "Hokusai3")[1], met.brewer(name = "Hokusai3")[3])) +
  theme_classic() +
  theme(panel.grid = element_blank(),
        text = element_text(size = 16))

A data problem: divorce rate and marriage rate.

Code

Waffle_data %>%
  ggplot(aes(x = Marriage, y = Divorce)) +
  stat_smooth(method = "lm", fullrange = T, size = 1/2,
              color = "firebrick4", fill = "firebrick", alpha = 1/5) +
  geom_point(size = 3.5, color = "firebrick4", alpha = 1/2) +
  #scale_x_continuous("Marriage rate", limits = c(22, 30)) +
  ylab("Divorce rate") +
  xlab("Marriage rate") +
  theme_bw() +
  theme(panel.grid = element_blank())

These two variables are correlated but is marriage rate causal of divorce rate?

Hold on, but there are other things we need to consider.

What else could affect divorce? Age at marriage certainly might. Lets plot:

Code

Waffle_data %>%
  ggplot(aes(x = MedianAgeMarriage, y = Divorce)) +
  stat_smooth(method = "lm", fullrange = T, size = 1/2,
              color = "firebrick4", fill = "firebrick", alpha = 1/5) +
  geom_point(size = 3, color = "firebrick4", alpha = 1/2) +
  scale_x_continuous("Median age at marriage", limits = c(22, 30)) +
  ylab("Divorce rate") +
  theme_bw() +
  theme(panel.grid = element_blank())

Lets account for this and build the DAG, considering that age of marriage could plausibly lead to more marriage.

To estimate the direct effect of marriage, you must break the fork - that is stratify by the common cause - age at marriage.

Code

dag_coords <-
  tibble(name = c("A", "M", "D"),
         x    = c(1, 3, 2),
         y    = c(2, 2, 1))

p1 <-
  dagify(M ~ A,
         D ~ A + M,
         coords = dag_coords) %>%
  
  gg_simple_dag()

p1

Scientific model

What does stratify mean:

It creates sub-populations - in a linear regression it does this:

$D_{i} = Normal(\mu_{i}, \sigma)$

$\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}$

That is, make the marriage variable intercept conditional upon age of marriage. To do this simply include it as a term.

We can also stratify using interactions. But how we should do it depends on your casual situation.

Statistical model

$D_{i} = Normal(\mu_{i}, \sigma)$

$\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}$

$\alpha = Normal(?, ?)$

$\beta_{M} = Normal(?, ?)$

$\beta_{A} = Normal(?, ?)$

$\sigma = Exponential(?)$

Before we talk about the exponential prior lets standardise the data (done above but explained here)

Why standardise?

To standardise subtract the mean and divide each value by the standard deviation. Variables become Z scores - values represent SDs from the mean. 0 is the mean.

This makes computation more efficient.

Priors are easier to choose because we have some understanding about effect size.

Lets do some prior simulation, first with weak priors:

$\alpha = Normal(0, 10)$

$\beta_{M} = Normal(0, 10)$

$\beta_{A} = Normal(0, 10)$

$\sigma = Exponential(1)$

These priors are so broad they are essentially flat. This is a bad idea. The reason is because almost all the probability mass is for insanely strong pos or neg relationships.

The plot below shows the stupidly steep slopes these priors predict.

Code

# fit a model with bad priors

Waffle_model_bad_priors <- 
  brm(data = Waffle_data, 
      family = gaussian,
      d ~ 1 + a,
      prior = c(prior(normal(0, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      sample_prior = T,
      file = "fits/Waffle_model_bad_priors")

set.seed(5)

prior_draws(Waffle_model_bad_priors) %>% 
  slice_sample(n = 50) %>% 
  rownames_to_column("draw") %>% 
  expand(nesting(draw, Intercept, b),
         a = c(-2, 2)) %>% 
  mutate(d = Intercept + b * a) %>% 
  
  ggplot(aes(x = a, y = d)) +
  geom_line(aes(group = draw),
            color = "firebrick", alpha = .4) +
  labs(x = "Median age marriage (std)",
       y = "Divorce rate (std)") +
  coord_cartesian(ylim = c(-2, 2)) +
  theme_bw() +
  theme(panel.grid = element_blank())

Better priors

$\alpha = Normal(0, 0.2)$

$\beta_{M} = Normal(0, 0.5)$

$\beta_{A} = Normal(0, 0.5)$

$\sigma = Exponential(1)$

Re-sample

Code

Marriage_age_model <- 
  brm(data = Waffle_data, 
      family = gaussian,
      d ~ 1 + a,
      prior = c(prior(normal(0, 0.2), class = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      sample_prior = T,
      file = "fits/Marriage_age_model")

Marriage_age_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: d ~ 1 + a 
   Data: Waffle_data (Number of observations: 50) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.00      0.10    -0.20     0.20 1.00     3537     2678
a            -0.57      0.12    -0.79    -0.34 1.00     3801     2753

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.82      0.09     0.68     1.01 1.00     2735     2675

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Lets have a look at that plot again with our updated priors

Code

prior_draws(Marriage_age_model) %>% 
  slice_sample(n = 50) %>% 
  rownames_to_column("draw") %>% 
  expand(nesting(draw, Intercept, b),
         a = c(-2, 2)) %>% 
  mutate(d = Intercept + b * a) %>% 
  
  ggplot(aes(x = a, y = d)) +
  geom_line(aes(group = draw),
            color = "firebrick", alpha = .4) +
  labs(x = "Median age marriage (std)",
       y = "Divorce rate (std)") +
  coord_cartesian(ylim = c(-2, 2)) +
  theme_bw() +
  theme(panel.grid = element_blank())

Much better!

Lets test for the causal effect of marriage, while closing the fork

Code

marriage_model <- 
  brm(data = Waffle_data, 
      family = gaussian,
      d ~ 1 + m + a,
      prior = c(prior(normal(0, 0.2), class = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/marriage_model")

marriage_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: d ~ 1 + m + a 
   Data: Waffle_data (Number of observations: 50) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.00      0.10    -0.20     0.20 1.00     3514     2688
m            -0.06      0.16    -0.38     0.26 1.00     2519     2660
a            -0.61      0.16    -0.93    -0.28 1.00     2626     2584

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.83      0.09     0.68     1.02 1.00     3296     2324

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Ok the exponential distribution:

Constrained to be positive
Only info is average displacement from 0, hence the one value, also called the rate.
A value of 1 means roughly 1 SD from 0
Great for sigma

Lets look at the coefficients:

Code

mcmc_plot(marriage_model)

There is a very small effect of marriage, but we aren’t sure if this very small effect is positive or negative. Essentially no causal effect on divorce.

Age of marriage has a large negative effect though.

$~$

Gold standard: simulating an intervention from the data

Code

post <- as_draws_df(marriage_model)

n <- 4000
A <- sample(Waffle_data$a, size = n, replace = T)

# simulate divorce for M = 0 
bind_cols(
post %>% 
  mutate(m_0 = rnorm(n, b_Intercept + b_m * 0 + b_a * A, sigma)) %>% 
  select(m_0),

# now 1 sd above the mean

post %>% 
  mutate(m_1 = rnorm(n, b_Intercept + b_m * 1 + b_a * A, sigma)) %>% 
  select(m_1)
) %>% 
  mutate(diff = m_1 - m_0) %>% 
  
  ggplot(aes(x = diff)) +
  geom_density(fill = met.brewer("Hokusai2")[1]) +
    labs(x = "Effect of 1 sd increase in M on D") +
  #coord_cartesian(ylim = c(-2, 2)) +
  theme_bw() +
  theme(panel.grid = element_blank())

$~$

The Pipe

$~$

Structurally different to the fork, but statistically handled similarly.

In this case Z is a mediator variable. That is, the influence of X on Y is transmitted through Z.

Code

p2

Simulate to understand. Look carefully, we don’t code any effect of X on Y or vice-versa.

Code

n <- 1000

tibble(X = rbernoulli(n, 0.5)) %>% 
  mutate(X = if_else(X == "TRUE", 1, 0),
         Z = rbernoulli(n, (1 - X) * 0.1 + X * 0.9),
         Y = rbernoulli(n, (1 - Z)* 0.1 + Z * 0.9),
         Z = if_else(X == "TRUE", 1, 0),
         Y = if_else(Y == "TRUE", 1, 0)) %>% 
  count(X, Y)

# A tibble: 4 × 3
      X     Y     n
  <dbl> <dbl> <int>
1     0     0   431
2     0     1    90
3     1     0   111
4     1     1   368

Yet there appears to be an effect! This is because X effects Z which affects Y.

An example: a plant growth experiment

Create the data

Code

# how many plants would you like?
n <- 100

# h0 is the starting height
# treatment is an antifungal
# fungus is fungus presence
# h1 is end of experiment plant height

set.seed(71)

plant_experiment <- 
  tibble(h0        = rnorm(n, mean = 10, sd = 2), 
         treatment = rep(0:1, each = n / 2),
         fungus    = rbinom(n, size = 1, prob = .5 - treatment * 0.4),
         h1        = h0 + rnorm(n, mean = 5 - 3 * fungus, sd = 1)) %>% 
  mutate(treatment = as.factor(treatment),
         fungus = as.factor(fungus))

There are 100 plants troubled by fungal growth. Half the plants receive anti-fungal treatments. We measure growth and presence of fungus.

The estimand is the effect of anti-fungal treatment on growth.
Scientific model:

Code

# define our coordinates
dag_coords <-
  tibble(name = c("H0", "T", "F", "H1"),
         x    = c(1, 5, 4, 3),
         y    = c(2, 2, 1.5, 1))

# save our DAG
dag <-
  dagify(F ~ T,
         H1 ~ H0 + F,
         coords = dag_coords)

# plot 
dag %>%
 gg_simple_dag()

This is a harder problem than it first appears

The correct stats analysis here is to ignore the fungal status. This is because this will block the pipe! We remove the effect of the desired part of the experiment.

See the model output

Code

fungal_model <- 
  brm(data = plant_experiment, 
      family = gaussian,
      h1 ~ 0 + h0 + treatment + fungus,
      prior = c(prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fungal_model")

fungal_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: h1 ~ 0 + h0 + treatment + fungus 
   Data: plant_experiment (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
h0             1.34      0.03     1.28     1.41 1.00     1371     1559
treatment0     1.00      0.36     0.28     1.68 1.00     1468     2101
treatment1     1.46      0.37     0.71     2.16 1.00     1481     2258
fungus1       -1.73      0.29    -2.28    -1.15 1.00     2674     2344

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.33      0.11     1.13     1.57 1.00     1784     2131

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The treatment works through fungal growth, so if we account for fungal growth, of course we will find no effect. So the message is not to stratify by an effect of the treatment you’re interested in.

To prove this, lets fit the more logical model:

Code

fungal_model_causal <- 
  brm(data = plant_experiment, 
      family = gaussian,
      h1 ~ 0 + h0 + treatment,
      prior = c(prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fungal_model_causal")

fungal_model_causal

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: h1 ~ 0 + h0 + treatment 
   Data: plant_experiment (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
h0             1.34      0.04     1.27     1.41 1.00     1293     1603
treatment0    -0.01      0.37    -0.75     0.69 1.00     1454     1903
treatment1     1.04      0.38     0.30     1.79 1.00     1338     1618

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.72      0.12     1.49     1.98 1.00     2100     2128

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

You should be very wary of including consequences of your treatment other than the desired outcome variable because of the pipe.

$~$

The Collider

$~$

Code

p3

X and Y are not associated but both influence Z.

If you stratify by Z, then X and Y become associated! How??

If we learn the value of Z we have to learn something about X and Y. Still confused?

An example: Selection bias in grants

There are 200 grant applications
Rated on trustworthiness and newsworthiness
These are uncorrelated variables

Load in the data and plot the relationship

Code

set.seed(1914)
n <- 200  # number of grant proposals
p <- 0.1  # proportion to select

grants <-
  # uncorrelated newsworthiness and trustworthiness
  tibble(newsworthiness  = rnorm(n, mean = 0, sd = 1),
         trustworthiness = rnorm(n, mean = 0, sd = 1)) %>% 
  # total_score
  mutate(total_score = newsworthiness + trustworthiness) %>% 
  # select top 10% of combined scores
  mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))

Code

# we'll need this for the annotation
text <-
  tibble(newsworthiness  = c(2, 1), 
         trustworthiness = c(2.25, -2.5),
         selected        = c(TRUE, FALSE),
         label           = c("selected", "rejected"))

grants %>% 
  ggplot(aes(x = newsworthiness, y = trustworthiness, color = selected)) +
  geom_point(aes(shape = selected), alpha = 3/4, stroke = 1.4) +
  geom_text(data = text,
            aes(label = label)) +
  geom_smooth(data = . %>% filter(selected == TRUE),
              method = "lm", fullrange = T,
              color = "lightblue", se = F, size = 1/2) +
  scale_color_manual(values = c("black", "lightblue")) +
  scale_shape_manual(values = c(1, 19)) +
  scale_x_continuous(limits = c(-3, 3.9), expand = c(0, 0)) +
  coord_cartesian(ylim = range(grants$trustworthiness)) +
  theme(legend.position = "none") +
  theme_minimal()

After selection there is a strong negative correlation.

Ok, so due to selection the only grants that make it are either high in newsworthiness or trustworthiness (makes sense). From a random distribution few grants will be high in both (this is much rarer than being high in one of the categories), The result is a subset of grants that are good at one thing but mainly bad at the other, which creates the appearance of a negative correlation across the entire sample. The key is to not subset the data (don’t stratify)

Example 2: selection bias in restaurants

Restaurants are successful because they have good food or because they’re in a good location.
Only the bad restaurants in really good locations can survive.
Creates a negative relationship between food quality and location.

Example 3: selection bias in actors

Actors can be successful because they are skilled or because they are good looking.
Really bad actors can survive if they are attractive - they end up being the only less skilled actors that can survive.
Selection creates a negative relationship between looks and skill.

$~$

Stats example: endogenous colliders

If you condition on a collider, you create phantom non-causal associations.

Example: Does age influence happiness?

Code

dag_coords <-
  tibble(name = c("H", "M", "A"),
         x    = 1:3,
         y    = 1)

dagify(M ~ H + A,
       coords = dag_coords) %>%
  gg_simple_dag()

The estimand: Age affects happiness

This is possibly confounded by marital status.

Suppose age has no effect on happiness but both affect marital status.

So the collider scenario here is that of the married people, only a very advanced age can overcome unhapiness or a very happy dispoistion can overcome a very young age = a neg relationship

Lets sim this to visualise how

Code

marriage_sim <- rethinking::sim_happiness(seed = 1977, N_years = 1000)

marriage_sim %>% 
  mutate(married = factor(married,
                          labels = c("unmarried", "married"))) %>% 
  
  ggplot(aes(x = age, y = happiness, color = married)) +
  geom_point(size = 1.75) +
  scale_color_manual(NULL, values = c("grey85", "forestgreen")) +
  scale_x_continuous(expand = c(.015, .015)) +
  theme(panel.grid = element_blank()) +
  theme_minimal()

If the visualisation isn’t enough to convince you, lets fit the models.

First with the collider

Code

marriage_sim_2 <-
  marriage_sim %>% 
  mutate(mid = factor(married + 1, labels = c("single", "married"))) %>% 
  # only inlcude those 18 and above
  filter(age > 17) %>% 
  # create a, a standarised variable where 18 = 0 and 65 = 1
  mutate(a = (age - 18) / (65 - 18))

marriage_happiness_model <- 
  brm(data = marriage_sim_2, 
      family = gaussian,
      happiness ~ 0 + mid + a,
      prior = c(prior(normal(0, 1), class = b, coef = midmarried),
                prior(normal(0, 1), class = b, coef = midsingle),
                prior(normal(0, 2), class = b, coef = a),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 6,
      file = "fits/marriage_happiness_model")

marriage_happiness_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: happiness ~ 0 + mid + a 
   Data: marriage_sim_2 (Number of observations: 960) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
midsingle     -0.24      0.06    -0.36    -0.11 1.00     1732     2058
midmarried     1.26      0.08     1.09     1.42 1.00     1758     1943
a             -0.75      0.11    -0.96    -0.53 1.00     1508     2084

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.99      0.02     0.95     1.04 1.00     2479     2033

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

As we expected, age appears to be negatively associated with happiness.

Now remove the collider

Code

marriage_happiness_model_causal <- 
  brm(data = marriage_sim_2, 
      family = gaussian,
      happiness ~ 0 + a,
      prior = c(prior(normal(0, 2), class = b, coef = a),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 6,
      file = "fits/marriage_happiness_model_causal")

marriage_happiness_model_causal

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: happiness ~ 0 + a 
   Data: marriage_sim_2 (Number of observations: 960) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
a    -0.00      0.07    -0.13     0.14 1.00     3281     2571

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.21      0.03     1.16     1.27 1.00     3192     2242

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Age has literally no effect on happiness now, as we specified in the simulation.

$~$

The Descendant

$~$

Code

d4 <- 
  dagify(Z ~ X + Y,
         D ~ Z,
         coords = tibble(name = c("X", "Y", "Z", "D"),
                                x = c(1, 3, 2, 2),
                                  y = c(1, 1, 2, 1.05))
         )

p4 <- gg_simple_dag(d4) + labs(subtitle = "The Descendant")

p4

The variable A is the descendant of the Z variable.

The consequence is if we condition (stratify) by A it will have the same effect as conditioning by Z. While A does not affect X or Y, it is related to Z so it has a weaker (depending on the A -> association strength) but tangible effect on X/Y.

Think about how common descendants are. Any proxy is a descendant!

E.g. all fitness measures are proxies and therefore descendants of true fitness.

$~$

Lecture 6: Do calculus and good and bad controls

$~$

Unobserved confounds

$~$

Suppose we are interested in the direct effect of grandparents education on their grandchildrens education.

They will have direct and indirect effects on these kids

Estimand: Direct effect of grandparents G on children C.

But parent education is obviously an issue here.

Code

dag_coords <-
  tibble(name = c("G", "P", "C", "U"),
         x    = c(1, 2, 2, 3),
         y    = c(2, 2, 1, 1.5))

dagify(P ~ G + U,
       C ~ P + G + U,
       coords = dag_coords) %>%
  gg_simple_dag()

The parental effect is likely confounded with living conditions. However, this might not be the case for grandparents who live in a different area.

Remember what our goal is: estimate the direct causal effect of grandparents.

If the situation was this:

Code

dagify(P ~ G,
       C ~ P + G,
       coords = dag_coords) %>%
  gg_simple_dag()

Then we would just have to stratify by P to block the pipe and find the direct effect of G.

Unfortunately, U exists, which makes P a collider! So this makes P potentially associated with C because U affects them both.

We are left with only bad choices. Sometimes this is the way it is.

Why does collider bias happen: continuing with the parental example.

Let’s simulate the data, with G having no causal effect on C.

Code

# how many grandparent-parent-child triads would you like?
n    <- 200 

b_gp <- 1  # direct effect of G on P
b_gc <- 0  # direct effect of G on C
b_pc <- 1  # direct effect of P on C
b_u  <- 2  # direct effect of U on P and C

# simulate triads
set.seed(1)

d <-
  tibble(u = 2 * rbinom(n, size = 1, prob = .5) - 1,
         g = rnorm(n, mean = 0, sd = 1)) %>% 
  mutate(p = rnorm(n, mean = b_gp * g + b_u * u, sd = 1)) %>% 
  mutate(c = rnorm(n, mean = b_pc * p + b_gc * g + b_u * u, sd = 1))

Now we’ll fit the model without U - which closes the pipe but opens the collider.

Code

parent_collider_model <- 
  brm(data = d, 
      family = gaussian,
      c ~ 0 + Intercept + p + g,
      prior = c(prior(normal(0, 1), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 6,
      file = "fits/parent_collider_model")

parent_collider_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: c ~ 0 + Intercept + p + g 
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.12      0.10    -0.31     0.07 1.00     4158     2829
p             1.79      0.05     1.70     1.87 1.00     3550     3015
g            -0.84      0.11    -1.04    -0.63 1.00     3927     2951

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.43      0.07     1.30     1.57 1.00     4142     2926

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Lets plot what’s happening

Code

d %>% 
  mutate(centile = ifelse(p >= quantile(p, prob = .45) & p <= quantile(p, prob = .60), "a", "b"),
         u = factor(u)) %>%
  
  ggplot(aes(x = g, y = c)) +
  geom_point(aes(shape = centile, color = u),
             size = 2.5, stroke = 1/4) +
  stat_smooth(data = . %>% filter(centile == "a"),
              method = "lm", se = F, size = 1/2, color = "black", fullrange = T) +
  scale_shape_manual(values = c(19, 1)) +
  scale_color_manual(values = c("black", "lightblue")) +
  theme(legend.position = "none")

Stratifying by P as the regression line shows creates a negative relationship where nothing causal is happening!

Now we’ll fit the model with U

Code

grandparent_total <- 
  update(parent_collider_model,
         newdata = d,
         formula = c ~ 0 + Intercept + p + g + u,
         seed = 6,
         file = "fits/grandparent_total")

grandparent_total

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: c ~ Intercept + p + g + u - 1 
   Data: d (Number of observations: 200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.12      0.07    -0.26     0.03 1.00     3264     2707
p             1.01      0.07     0.88     1.15 1.00     1603     2175
g            -0.04      0.10    -0.24     0.15 1.00     2016     2141
u             1.99      0.15     1.70     2.28 1.00     1621     2264

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.04      0.05     0.94     1.14 1.00     2871     2527

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Note the change in the effect of G on C. Remember that we coded it to have 0 effect in our simulation!

In short: there are two ways for parents to attain their education: from G or from U. This is the classic setup for a collider.

$~$

DAG thinking

$~$

In an experiment we aim to cut all the causes of the treatment through randomisation.

But if we don’t randomise, can we overcome this with statistics? Sometimes.

First some notation:

$P(Y|do(X)) = P(Y|?)$

Where P(Y) is the distribution of Y

So the above means the distribution of Y conditional upon do(X) equals the distribution of Y conditional on some set of information.

do(X) means to intervene on X - to set it to some particular value.

We can do this with a DAG!

Code

dag_coords <-
  tibble(name = c("U", "X", "Y"),
         x    = c(2, 1, 3),
         y    = c(2, 1, 1))

dagify(X ~ U,
       Y ~ X + U,
       coords = dag_coords) %>%
  gg_simple_dag()

If we don’t know U we can find the distribution of Y, stratified by X and U, averaged over the distribution of U. But how?

Remember that the causal effect of X on Y is not the coefficient relating X to Y. It is actually the distribution of Y when we change X, averaged over the distributions of the control variables (U in this case). This is a marginal effect.

Do calculus

The back-door criterion

A way to analyse graphs to find sets of variables to stratify by

It can also help us design research.

Formally it is a rule to find a set of variables to stratify (condition) by to yield $P(Y|do(X))$

The steps:

Identify all paths connecting the treatment (X) to the outcome (Y).
Once you have the paths, focus on the paths that enter X. These are back-door paths (non-causal).
Find adjustment set (set of variables to stratify by) that closes/blocks all back-door paths.

Controls

Control variable: a variable we introduce to an analysis so that a causal estimate of something else is possible.

Things you should not do:

Add everything you’ve measured
Including all variables that aren’t collinear: sometimes both variables play a role in causality
Anything measured before treatment is fair game to add - not true

Taking some examples from Cinelli, Forney and Pearl 2021: A crash course in good and bad controls

Example 1

Code

dag_coords <-
  tibble(name = c("u", "X", "Z", "v", "Y"),
         x    = c(1, 1, 2, 3, 3),
         y    = c(2, 1, 1.6, 2, 1))

dagify(X ~ u,
       Z ~ u + v,
       Y ~ v + X,
       coords = dag_coords) %>%
  gg_simple_dag()

Where u and v are unobserved variables.

Let’s say that Z denotes friendship status and X and Y the health of two people. u and v are the hobbies of person X and person Y.

If we stratify by Z, that is, include it as a control, then we open a back-door path (it is a collider). This is bad for estimating the causal effect of X on Y.

Note that Z could be a pre-treatment variable (I haven’t been taught this but apparently it’s a thing).

Do calculus time - list the paths:

X -> Y: front door
X <- u -> Z <- v -> Y: backdoor. Closed unless you include Z. A bad control.

Example 2

Code

dag_coords <-
  tibble(name = c("X", "Z", "u", "Y"),
         x    = c(1, 2, 2.5, 3),
         y    = c(1, 1, 1.2, 1))

dagify(Z ~ u + X,
       Y ~ Z + u,
       coords = dag_coords) %>%
  gg_simple_dag()

Analyse the paths:

X -> Z -> Y: front

X -> Z <- u -> Y: still front

Z is a mediator here, so if the goal to to estimate the effect of X on Y, we should not include Z in the model anyway. There is no back-door path we have to block.

However, it’s worse than that, because u - an unexplained variable - creates a fork. Including Z makes the estimate even worse than it would usually. This should always be noted where things might have a common cause e.g. happiness and lifespan or body size and fecundity.

Add sim and plot

Note that Z would turn out to be significant. This is spurious! Don’t do backwards step selection. You need a causal model.

$~$

Do not touch the collider

$~$

DAG example

Case control bias

Code

dag_coords <-
  tibble(name = c("X", "Z", "Y"),
         x    = c(1, 2, 2),
         y    = c(1, 2, 1))

dagify(Z ~ Y,
       Y ~ X,
       coords = dag_coords) %>%
  gg_simple_dag()

Ok, so this is a simple pipe. But if we want to know the causal effect of X on Y then we run into an issue if we control for Z.

Imagine that X = education, Y = occupation and Z = Income. If we control for income, then this removes variation in Y, leaving less for education to explain - the effect size is underestimated. Now Z is not a cause of Y, but the model has no way of knowing that so it just measures the covariance. We are required to make this distinction.

A common bad control. When you apply selection on the outcome - this ruins scientific influence.

E.g. if you want to know the effect of education on occupation then you should not stratify by income because it narrows the range of cases you compare at each level.

Sim this one as well code at 58.34

$~$

Precision parasite

Code

dag_coords <-
  tibble(name = c("X", "Z", "Y"),
         x    = c(1, 1, 2),
         y    = c(1, 2, 1))

dagify(X ~ Z,
       Y ~ X,
       coords = dag_coords) %>%
  gg_simple_dag()

While Z affects X it is not a back-door path because it does not connect to Y (except via the front door). However, it is still not good to condition on Z.

The parasite does not change the mean estimate, but it creates an estimate with higher variance. We lose precision.

Another sim

$~$

Bias amplification

A bad version of the precision parasite.

Code

dag_coords <-
  tibble(name = c("X", "Z", "Y", "u"),
         x    = c(1, 1, 2, 1.5),
         y    = c(1, 2, 1, 1.75))

dagify(X ~ Z + u,
       Y ~ X + u,
       coords = dag_coords) %>%
  gg_simple_dag()

Remember u is an unmeasured variable. We can’t get an unbiased causal inference in this scenario.

If we add Z, the world implodes. We get a really biased estimate, with an inflated effect size.

WHY?

Covariation between X and Y requires variation in their causes. Variation is removed by conditioning - we are accounting for this association then finding the variance explained by the remaining variables. So stratifying on X removes variation in X, which gives more weight to be explained by the confound U = stronger false relationship!

Hard to get my head around but wow.

Lets plot to help understanding.

We’ll sim a situation like the above. Z is a binomial variable that causally affects X. X has no effect on Y. However, u affects both X and Y.

Code

n <- 1000
Z <- rethinking::rbern(n)
u <- rnorm(n)
X <- rnorm(n, 7*Z + u)
Y <- rnorm(n, 0*X + u)

tibble(Z = as.factor(Z),
       u = u,
       X = X,
       Y = Y) %>% 
  
  ggplot(aes(x = X, y = Y, colour = Z)) +
  geom_point(aes(shape = Z), stroke =2, size = 1.2, alpha = 0.5) +
  geom_smooth(
    method = "lm", fullrange = T,
    color = "black", se = F, size = 1) +
  geom_smooth(data = . %>% filter(Z == "1"),
              method = "lm", fullrange = T,
              color = "blue", se = F, size = 1) +
  geom_smooth(data = . %>% filter(Z == "0"),
              method = "lm", fullrange = T,
              color = "red", se = F, size = 1) +
  scale_shape_manual(values = c(1, 19))

There is less variation in each ‘stratification’!

$~$

The bias is amplified,

Why?

X can’t vary unless its causes vary
If Z doesn’t vary then X can’t vary. Stratifying by this removes variance in X and in that way Y. This means a larger proportion of the variation in X comes from the confounding fork u.

$~$

The Table 2 fallacy

$~$

In many fields the typical Table 2 in a manuscript will contain model coefficients.

The thing is, not all the coefficients are causal effects and there is no information that tells you whether they are.

Remember you have designed your model to identify the causal effect of X on Y. It is dangerous to interpret the effects of other variables that you have not designed the model for. The other variables are there to block back-door paths.

Code

dag_coords <-
  tibble(name = c("S", "A", "X", "Y"),
         x    = c(1, 1, 2, 3),
         y    = c(3, 1, 2, 2))

dagify(S ~ A,
       X ~ S + A,
       Y ~ S + A + X,
       coords = dag_coords) %>%
  gg_simple_dag()

S = smoking

A = age

X = HIV

Y = stroke

We want to know the effect of X on Y - the effect of HIV on stroke risk.

Things to note:

Age has many effects, including some quite indirect ones
We need to decontaminate X of the effects of smoking and age.

Back-door criterion:

List paths

X -> Y
X <- S -> Y
X <- A -> Y
X <- A -> S -> Y

We need to close paths 2-4.

We can condition by A and S to effectively do this.

So we could fit a model that looks like this:

\[Y = X + S + A\] What if we focus on the effect of S?

Code

dagify(S ~ A,
       X ~ S + A,
       Y ~ S + A + X,
       coords = dag_coords) %>%
  gg_simple_dag()

S is confounded by A, as there is a back-door path. But we condition on A and X so that back-door path is closed.

BUT

What are we estimating for S? Well we have removed the indirect effect of S that acts through X, as we have closed that pipe. I.e. We are only estimating how smoking directly affects stroke, but not including how smoking acts on stroke through increasing the risk of HIV. This might be small or it might be large, we don’t know. The take home is this coefficient measures something different than that of X.

Now lets focus on Age.

It acts through everything, so we need not include any of the conditional variables to find its total effect on stroke (both direct and indirect). But we have conditioned on S and X. We have closed these pathways. We only get the direct effect, which may be tiny!

Take-home: don’t interpret all coefficients as unconditional effects!

How to report results

Either

Just report causal coefficients
Give an explicit interpretation of each coefficient

$~$

Lecture 7: Fitting Over and Under

$~$

Ockham’s razor: the simple solution is often the best. This lecture will focus on why this is.

We need to design efficient estimators: the struggle against data. Causual inference doesn’t help us much here.

Problems with prediction

To aid our understanding let’s use a Hominin dataset. We are interested in the affect of weight (mass) on brain volume.

Code

# create the data and print the tibble

(
  hominin_data <- 
  tibble(species = c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens"), 
         brain   = c(438, 452, 612, 521, 752, 871, 1350), 
         mass    = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5))
  )

# A tibble: 7 × 3
  species     brain  mass
  <chr>       <dbl> <dbl>
1 afarensis     438  37  
2 africanus     452  35.5
3 habilis       612  34.5
4 boisei        521  41.5
5 rudolfensis   752  55.5
6 ergaster      871  61  
7 sapiens      1350  53.5

Code

# standardise. However, rescale the outcome, brain volume, so that the largest observed value is 1. Why not standardize brain volume as well? Because we want to preserve zero as a reference point: No brain at all. You can’t have negative brain. I don’t think.

hominin_data <-
  hominin_data %>% 
  mutate(mass_std = (mass - mean(mass)) / sd(mass),
         brain_std = brain / max(brain))

# plot the relationship between brain volume and weight


hominin_data %>% 
  ggplot(aes(x = mass, y = brain, label = species)) +
  geom_point(size = 5, colour = met.brewer("Cassatt1")[2]) +
  geom_text_repel(size = 3, colour = "black", family = "Courier", seed = 438, box.padding = 1) +
  labs(subtitle = "Average brain volume by body\nmass for six hominin species",
       x = "Body mass (kg)",
       y = "Brain volume (cc)") +
  xlim(30, 65) +
  theme_classic() +
    theme(text = element_text(family = "Courier"),
          panel.background = element_rect(fill = alpha(met.brewer("Cassatt1", 8)[4], 1/4)))

What can we do with statistical models:

Find a function that describes these points (fitting)
What function explains these points (causal inference)
What happens when we change a point’s mass (intervention)
What is the next observation from the same process (forecasting or prediction)

We will focus on point 4 in this lecture.

$~$

Cross Validation

$~$

For simple models, more parameters improves fit within sample but may reduce accuracy out of sample
There is a trade off between flexibility and overfitting

We can test how effective our linear regression is at prediction using Leave-one-out-cross-validation

The steps:

Drop one point
Fit line to remaining points
Predict dropped point
Repeat (1) with next point
Score based on the error between actual point and predicted point of the regression

Code

hominin_brain_mass_model <- 
  brm(data = hominin_data, 
      family = gaussian,
      brain_std ~ 1 + mass_std,
      prior = c(prior(normal(0.5, 1), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(lognormal(0, 1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 7,
      file = "fits/hominin_brain_mass_model")

Bayesian Cross-Validation

As usual we use distributions in bayes rather than points. The predictive distribution is uncertain, we don’t get a specific point estimate for the prediction. A cross-validation score is a bit harder to get over a distribution, but luckily we can follow the log pointwise predictive density equation. I won’t display it here because we can depend on the software. Just know that it is referred to as $lppd_{CV}$. Essentially this goes point by point like above, finds the posterior distribution, then averages this distribution over the number of samples (I think).

Some definitions:

In sample score: summed deviance of points from the full sample regression

Out of sample score: summed deviance of points from each out of sample regression

Let’s fit more complex models

Code

# quadratic
b7.2 <- 
  update(hominin_brain_mass_model,
         newdata = hominin_data, 
         formula = brain_std ~ 1 + mass_std + I(mass_std^2),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 7,
         file = "fits/b07.02")

# cubic
b7.3 <- 
  update(hominin_brain_mass_model,
         newdata = hominin_data, 
         formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 7,
         control = list(adapt_delta = .9),
         file = "fits/b07.03")


# fourth-order
b7.4 <- 
  update(hominin_brain_mass_model,
         newdata = hominin_data, 
         formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3) + I(mass_std^4),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 7,
         control = list(adapt_delta = .995),
         file = "fits/b07.04")

# fifth-order
b7.5 <- 
  update(hominin_brain_mass_model,
         newdata = hominin_data, 
         formula = brain_std ~ 1 + mass_std + I(mass_std^2) + I(mass_std^3) + I(mass_std^4) + I(mass_std^5),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 7,
         control = list(adapt_delta = .99999),
         file = "fits/b07.05")

# make a function for plotting

make_figure7.3 <- function(brms_fit, ylim = range(hominin_data$brain_std)) {
  
  
  # define the new data 
  nd <- tibble(mass_std = seq(from = -2, to = 2, length.out = 200))
  
  # simulate and wrangle
  fitted(brms_fit, newdata = nd, probs = c(.055, .945)) %>% 
    data.frame() %>% 
    bind_cols(nd) %>% 
    
    # plot!  
    ggplot(aes(x = mass_std)) +
    geom_lineribbon(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
                    color = met.brewer("Cassatt1")[1], size = 1/2, 
                    fill = alpha(met.brewer("Cassatt1")[2], 1/3)) +
    geom_point(data = hominin_data,
               aes(y = brain_std),
               color = met.brewer("Cassatt1")[2],
               size = 5) +
    labs(x = "body mass (std)",
         y = "brain volume (std)") +
    coord_cartesian(xlim = c(-1.2, 1.5),
                    ylim = ylim) +
    theme_classic() +
    theme(text = element_text(family = "Courier"),
          panel.background = element_rect(fill = alpha(met.brewer("Cassatt1", 8)[4], 1/4)))
  
}

p1 <- make_figure7.3(hominin_brain_mass_model)
p2 <- make_figure7.3(b7.2)
p3 <- make_figure7.3(b7.3)
p4 <- make_figure7.3(b7.4, ylim = c(.25, 1.1))
p5 <- make_figure7.3(b7.5, ylim = c(.1, 1.4))


((p1 | p2) / (p3 | p4) / (p5)) +
  plot_annotation(title = "Figure7.3. Polynomial linear models of increasing\ndegree for the hominin data.")

So how does this affect our fit?

The in-sample score improves with model complexity, but the out of sample score gets worse (larger)! For the 4th order polynomial model in sample is insanely good, but the fit varies wildly once we remove a point. The take home is that we can build really complex models that fit the observed data really well, but have little to no predictive capacity for new data points. This is the basic trade-off inherent to modelling: over fitting leads to bad predictions.

But what if we have more data?

With more data we find that there is an intermediate model complexity (the 2nd degree model in this case) that maximises both within and out of sample deviance; that is, it is the best at describing these data. However, this doesn’t mean that it explains them, it’s just best at pure prediction in the absence of intervention. If we want to do this, we need to understand the causal structure.

From McElreath

The overfit polynomial models fit the data extremely well, but they suffer for this within-sample accuracy by making nonsensical out-of-sample predictions. In contrast, underfitting produces models that are inaccurate both within and out of sample. They learn too little, failing to recover regular features of the sample. (p. 201)

$~$

To explore the distinctions between overfitting and underfitting, we’ll need to refit the models above several times after serially dropping one of the rows in the data. You can filter() by row_number() to drop rows in a tidyverse kind of way. For example, we can drop the second row of hominin_data like this.

Code

hominin_data %>%
  mutate(row = 1:n()) %>% 
  filter(row_number() != 2)

# A tibble: 6 × 6
  species     brain  mass mass_std brain_std   row
  <chr>       <dbl> <dbl>    <dbl>     <dbl> <int>
1 afarensis     438  37     -0.779     0.324     1
2 habilis       612  34.5   -1.01      0.453     3
3 boisei        521  41.5   -0.367     0.386     4
4 rudolfensis   752  55.5    0.917     0.557     5
5 ergaster      871  61      1.42      0.645     6
6 sapiens      1350  53.5    0.734     1         7

Now we’ll make a function brain_loo_lines() that will refit the model and extract lines information in one step.

Code

nd <- tibble(mass_std = seq(from = -2, to = 2, length.out = 100))

brain_loo_lines <- function(brms_fit, row, ...) {
  
  # refit the model
  new_fit <- 
    update(brms_fit,
           newdata = filter(hominin_data, row_number() != row), 
           iter = 2000, warmup = 1000, chains = 4, cores = 4,
           seed = 7,
           refresh = 0,
           ...)
  
  # pull the lines values
  fitted(new_fit, 
         newdata = nd) %>% 
    data.frame() %>% 
    select(Estimate) %>% 
    bind_cols(nd)
  
}

Now use map() to iterate the function on all LOO dataset versions

Code

hominin_fits <-
  tibble(row = 1:7) %>% 
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = hominin_brain_mass_model, row = .))) %>% 
  unnest(post)

hominin_polynomial_fits <-
  tibble(row = 1:7) %>% 
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.4, 
                                                 row = ., 
                                                 control = list(adapt_delta = .999)))) %>% 
  unnest(post)

Plot the results

Code

# left
p1 <-
  hominin_fits %>%  
  
  ggplot(aes(x = mass_std)) +
  geom_line(aes(y = Estimate, group = row),
            color = met.brewer("Cassatt1", 8)[2], linewidth = 1/2, alpha = 1/2) +
  geom_point(data = hominin_data,
             aes(y = brain_std),
             color = met.brewer("Cassatt1", 8)[2],
             size = 5) +
  labs(x = "body mass (std)",
       y = "brain volume (std)") +
  coord_cartesian(xlim = range(hominin_data$mass_std),
                  ylim = range(hominin_data$brain_std)) +
  theme_classic() +
  theme(text = element_text(family = "Courier"),
        panel.background = element_rect(fill = alpha(met.brewer("Cassatt1", 8)[4], 1/4)))

# right
p2 <-
  hominin_polynomial_fits %>%  
  
  ggplot(aes(x = mass_std, y = Estimate)) +
  geom_line(aes(group = row),
            color = met.brewer("Cassatt1", 8)[2], linewidth = 1/2, alpha = 1/2) +
  geom_point(data = hominin_data,
             aes(y = brain_std),
             color = met.brewer("Cassatt1", 8)[2],
             size = 5) +
  labs(x = "body mass (std)",
       y = "brain volume (std)") +
  coord_cartesian(xlim = range(hominin_data$mass_std),
                  ylim = c(-0.1, 1.4)) +
  theme_classic() +
  theme(text = element_text(family = "Courier"),
        panel.background = element_rect(fill = alpha(met.brewer("Cassatt1", 8)[4], 1/4)))

# combine
p1 + p2

From McElreath

“Notice that the straight lines hardly vary, while the curves fly about wildly. This is a general contrast between underfit and overfit models: sensitivity to the exact composition of the sample used to fit the model” (p. 201).

Regularisation

$~$

We can design models that are better at prediction in their structure.

We want to identify regular features in the data, this leads to good predictions

We can minimise over fitting with:

Priors:

Sceptical priors have tighter variance, and reduce the flexibility of a model
Often tighter than we think they need to be
These improve predictions
Essentially, tight priors stop the model overreacting to nonsense within the sample

Choosing the width of a prior

For causal inference, use science, then go a touch tighter.
For pure prediction, we can use cross-validation to tune the prior.
Many tasks are a mix of both

Prediction penalty

For N points, cross-validation requires fitting N models. This quickly becomes computationally expensive, when your dataset gets larger.

Good news! We can compute the prediction penalty from a single model fit. There are two ways:

Importance sampling (PSIS)
Information criteria (WAIC or LOO)

$~$

Predictive criteria should not be casued to choose model structure

They often prefer confounds and colliders

Example: return of the plant growth experiment

The DAG looks like this:

Code

# define our coordinates
dag_coords <-
  tibble(name = c("H0", "T", "F", "H1"),
         x    = c(1, 5, 4, 3),
         y    = c(2, 2, 1.5, 1))

# save our DAG
dag <-
  dagify(F ~ T,
         H1 ~ H0 + F,
         coords = dag_coords)

# plot 
dag %>%
 gg_simple_dag()

And the data:

Code

head(plant_experiment)

# A tibble: 6 × 4
     h0 treatment fungus    h1
  <dbl> <fct>     <fct>  <dbl>
1  9.14 0         0       14.3
2  9.11 0         0       15.6
3  9.04 0         0       14.4
4 10.8  0         0       15.8
5  9.16 0         1       11.5
6  7.63 0         0       11.1

Now lets add the information criterion loo (leave one out) to the model object. loo is the PSIS metric McElreath spruiks. Then plot the result.

Code

fungal_model <- add_criterion(fungal_model, criterion = "loo")
fungal_model_causal <- add_criterion(fungal_model_causal, criterion = "loo")

loo <- loo_compare(fungal_model, fungal_model_causal, criterion = "loo")

print(loo, simplify = F)

                    elpd_diff se_diff elpd_loo se_elpd_loo p_loo  se_p_loo
fungal_model           0.0       0.0  -172.6      6.4         3.4    0.4  
fungal_model_causal  -25.2      10.0  -197.9      7.2         3.0    0.5  
                    looic  se_looic
fungal_model         345.3   12.7  
fungal_model_causal  395.7   14.4

Code

loo[, 7:8] %>% 
  data.frame() %>% 
  rownames_to_column("model_name") %>% 
  mutate(model_name = fct_reorder(model_name, looic, .desc = T)) %>% 
  
  ggplot(aes(x = looic, y = model_name, 
             xmin = looic - se_looic, 
             xmax = looic + se_looic)) +
  geom_pointrange(color = met.brewer("Cassatt1", 8)[2],
                  fill = met.brewer("Cassatt1", 8)[1], shape = 21) +
  labs(x = "PSIS-LOO", y = NULL) +
  theme(axis.ticks.y = element_blank()) +
  theme_classic() +
  theme(text = element_text(family = "Courier"),
        panel.background = element_rect(fill = alpha(met.brewer("Cassatt1", 8)[4], 1/4)))

Fungus status is actually a better predictor of plant growth than anti-fungal treatment. Makes sense, the treatment is essentially an imperfect proxy for fungal status. For prediction this is better, but it’s terrible for determining the effect of the treatment. See that your preferred model depends entirely upon your aim.

Code

p1 <-
  plant_experiment %>% 
  mutate(growth = h1 - h0) %>% 
  ggplot(aes(x = fungus, y = growth, colour = treatment)) +
  geom_jitter(size = 3, shape = 1.5, stroke =2, alpha = 0.75) +
  labs(x = "Presence of fungus",
       y = "Plant growth") + 
  theme_classic()

p2 <-
  plant_experiment %>% 
  mutate(growth = h1 - h0) %>% 
  ggplot(aes(x = treatment, y = growth, colour = fungus)) +
  geom_jitter(size = 3, shape = 1.5, stroke =2, alpha = 0.75) +
  labs(x = "Presence of fungus",
       y = "Plant growth") +
  theme_classic()


p1 / p2

$~$

Importance sampling

$~$

Use a single posterior distribution for N points to sample for each posterior for N-1 points.

Key idea: Point with low probability has a strong influence on posterior distribution. Highly probable points on the other hand do not individually affect the posterior distribution much, because there are other points in close proximity conveying very similar information.

Can use pointwise probabilities to re-weight samples from the posterior.

BUT, importance sampling tends to be unreliable.

The best current one: Pareto-smoothed importance sampling

More stable (lower variance)
Useful diagnostics (it will tell you when it’s bad aka has high variance)
Identifies important (high leverage) points (outliers)

$~$

Information criteria

$~$

Akaike information criteria: AIC

The first form. For flat priors and large samples:

$AIC = (-2) * lppd + 2k$

Where -2 is a scaling term, lppd is the log pointwise predictive density and 2k is a penalty term for complexity, where k is the number of parameters.

If we want to use good priors, we can’t use it.

$~$

Widely applicable information criterion: WAIC

$WAIC(y, \Theta) = -2(lppd - \sum var_{\Theta} log p(y_{i}|\Theta))$

This is a more complex formula but is has some similarities to the AIC formula.

$\sum var_{\Theta} log p(y_{i}|\Theta)$ is the penalty term now.

$WAIC$ provides a very similar result to PSIS, but without the diagnostics.

$~$

Outliers and robust regression

$~$

Some points are more important for the posterior distribution than others.

Outliers are observations in the tails of predictive distributions

If a model has outliers, this indicates the model doesn’t expect enough variation. This can lead to poor, damaging predictions.

Say we want to predict hurricane strength. We really don’t want to disregard extreme hurricanes, as this would be disastrous.

Don’t drop outliers, it’s the models fault, not the data’s. But how do we deal with them?

It’s the model that’s wrong, not the data
First, quantify influence of each point - we can do this with cross validation
Second, use a mixture model with fatter tails i.e. the student t distribution.

$~$

Robust regression

We’re back at the divorce rate example, where Idaho and Maine are outliers. Idaho has the lowest median marriage age and a very low divorce rate, while Maine has a very high divorce rate, but an average age of marriage.

Code

b5.3 <- 
  brm(data = Waffle_data, 
      family = gaussian,
      d ~ 1 + m + a,
      prior = c(prior(normal(0, 0.2), class = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/b05.03")

b5.3 <- add_criterion(b5.3, criterion = "loo")

b5.3 <- add_criterion(b5.3, "waic", file = "fits/b05.03")

Waffle_data %>% 
  mutate(outlier = if_else(Location %in% c("Idaho", "Maine"), "YES", "NO")) %>% 
  ggplot(aes(x = a, y = d, colour = outlier)) +
  geom_point(shape = 1.5, stroke = 2, size = 3) +
  scale_colour_manual(values = c(met.brewer("Cassatt1")[2], met.brewer("Cassatt1")[6])) +
  geom_text_repel(aes(label = Location),
                  data = Waffle_data %>% filter(Location == c("Idaho", "Maine")), size = 5, colour = "black", family = "Courier", seed = 438, box.padding = 0.5) +
  labs(x = "Age at marriage (std)",
       y = "Divorce rate (std)") +
  theme_classic() +
  theme(legend.position = "none")

Using PSIS we can find which data points are outliers

Code

loo(b5.3)


Computed from 4000 by 50 log-likelihood matrix

         Estimate   SE
elpd_loo    -63.8  6.4
p_loo         4.7  1.9
looic       127.7 12.8
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     49    98.0%   730       
 (0.5, 0.7]   (ok)        1     2.0%   187       
   (0.7, 1]   (bad)       0     0.0%   <NA>      
   (1, Inf)   (very bad)  0     0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Ok so we have one point that is having a large effect on the posterior. Which is it?

Code

loo(b5.3) %>% 
  pareto_k_ids(threshold = 0.5)

[1] 13

Point 13, this means row 13 in the tibble. Lets find out which location that corresponds to.

Code

Waffle_data %>% 
  as_tibble() %>% 
  slice(13) %>% 
  select(Location:Loc)

# A tibble: 1 × 2
  Location Loc  
  <fct>    <fct>
1 Idaho    ID

We can quantify the influence using the PSIS k statistic or the WAIC penalty term (the effective number of parameters)

Let’s plot them both

Code

tibble(pareto_k = b5.3$criteria$loo$diagnostics$pareto_k,
       p_waic   = b5.3$criteria$waic$pointwise[, "p_waic"],
       Loc      = pull(Waffle_data, Loc)) %>% 
  
  ggplot(aes(x = pareto_k, y = p_waic, color = Loc == "ID")) +
  geom_vline(xintercept = .5, linetype = 2, color = "black", alpha = 1/2) +
  geom_point(aes(shape = Loc == "ID"), size = 3, stroke = 2, alpha = 0.8) +
  geom_text(data = . %>% filter(p_waic > 0.5),
            aes(x = pareto_k - 0.03, label = Loc),
            hjust = 1) +
  scale_color_manual(values = met.brewer("Cassatt1")[c(2, 5)]) +
  scale_shape_manual(values = c(1, 19)) +
  labs(subtitle = "Gaussian model (b5.3)") +
  theme_classic() +
  theme(legend.position = "none")

We can combat outliers with mixture models

Use multiple gaussian distributions with different variances: produces the student t distribution
We have a distribution with thicker tails, meaning that it expects more extreme points than your standard gaussian model

Code

tibble(x = seq(from = -6, to = 6, by = 0.01)) %>% 
  mutate(Gaussian    = dnorm(x),
         `Student-t` = dstudent_t(df = 2, x)) %>% 
  pivot_longer(-x,
               names_to = "likelihood",
               values_to = "density") %>% 
  mutate(`minus log density` = -log(density)) %>% 
  pivot_longer(contains("density")) %>% 
  
  ggplot(aes(x = x, y = value, group = likelihood, color = likelihood)) +
  geom_line() +
  scale_color_manual(values = c(met.brewer("Cassatt1")[2], "black")) +
  ylim(0, NA) +
  labs(x = "value", y = NULL) +
  theme(strip.background = element_blank()) +
  facet_wrap(~ name, scales = "free_y") +
  theme_classic()

Let’s try it out and retest those PSIS, WAIC values.

This is an easy model to fit in brms

Code

b5.3t <- 
  brm(data = Waffle_data, 
      family = student,
      bf(d ~ 1 + m + a, nu = 2),
      prior = c(prior(normal(0, 0.2), class = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/b05.03t")

b5.3t <- add_criterion(b5.3t, criterion = c("loo", "waic"))

# plot

tibble(pareto_k = b5.3t$criteria$loo$diagnostics$pareto_k,
       p_waic   = b5.3t$criteria$waic$pointwise[, "p_waic"],
       Loc      = pull(Waffle_data, Loc)) %>% 
  
  ggplot(aes(x = pareto_k, y = p_waic, color = Loc == "ID")) +
  geom_point(aes(shape = Loc == "ID"), size = 3, stroke = 2) +
  geom_text(data = . %>% filter(Loc %in% c("ID", "ME")),
            aes(x = pareto_k - 0.01, label = Loc),
            hjust = 1) +
  scale_color_manual(values = met.brewer("Cassatt1")[c(2, 5)]) +
  scale_shape_manual(values = c(1, 19)) +
  labs(subtitle = "Student-t model (b5.3t)") +
  theme_classic() +
  theme(legend.position = "none")

And a final comparison

Code

bind_rows(posterior_samples(b5.3),
          posterior_samples(b5.3t)) %>% 
  mutate(fit = rep(c("Gaussian (b5.3)", "Student-t (b5.3t)"), each = n() / 2)) %>% 
  pivot_longer(b_Intercept:sigma) %>% 
  mutate(name = factor(name,
                       levels = c("b_Intercept", "b_a", "b_m", "sigma"),
                       labels = c("alpha", "beta[a]", "beta[m]", "sigma"))) %>% 
  
  ggplot(aes(x = value, y = fit, color = fit)) +
  stat_pointinterval(.width = .95, size = 1) +
  scale_color_manual(values = c(met.brewer("Cassatt1")[1], "black")) +
  labs(x = "posterior", y = NULL) +
  theme_classic() +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        legend.position = "none",
        strip.background = element_rect(fill = alpha(met.brewer("Cassatt1")[2], 1/4), color = "transparent"),
        strip.text = element_text(size = 12)) +
  facet_wrap(~ name, ncol = 1, labeller = label_parsed)

Overall, the coefficients are very similar between the two models.

Robust regression summary

Lots of unobserved heterogeneity that comes from a mixture of Gaussian distributions
Thicker tails makes the model less surprised/influenced by extreme observations
Degrees of freedom determines how much weight to put in the tails. No good way of doing this other than trial and error…
Student-t regression is a nice default (perhaps better than Gaussian)

$~$

Lecture 8: Markov chain Monte Carlo

$~$

Computing the posterior:

Analytical approach (often impossible)
Grid approximation (too computationally intensive)
Quadratic approximation (Laplace approximation, use McElreath’s quap function)
Markov chain Monte Carlo (intensive but more capable than anything else)

$~$

King Markov

$~$

King Markov rules over an archipelago with 8 islands. Each island has a different population size and the citizens yearn for him to visit. He splits his time between the islands like so:

He flips a coin to choose an island on the left or right of his current location
He then finds the population of the chosen island, or proposal island $p_{2}$
He then finds the population of his current island, $p_{1}$
He moves to the proposal island, with probability $\frac{p_{2}}{p_{1}}$
After a week he repeats from (1)

This procedure ensures visiting each island in proportion to its population, in the long run.

Knowing this, we can simulate and plot his journey:

Code

set.seed(1)

num_weeks <- 1e5

positions <- rep(0, num_weeks)

current <- 10 # we start at island 10

for (i in 1:num_weeks) {
  # record current position
  positions[i] <- current
  # flip coin to generate proposal
  proposal <- current + sample(c(-1, 1), size = 1) # cool. sample 'flips the coin' so we either get -1 or 1, size specifies the number of coin flips
  # now make sure he loops around the archipelago
  if (proposal < 1) proposal <- 10
  if (proposal > 10) proposal <- 1
  # does he move?
  prob_move <- proposal/current
  current <- ifelse(runif(1) < prob_move, proposal, current) # if prob_move is > 1 move to proposal
}

# Note that positions is now a numeric vector that has been populated by the simulation

tibble(week   = 1:1e5,
       island = positions) %>%
  ggplot(aes(x = week, y = island)) +
  geom_point(shape = 1, color = "salmon") +
  scale_x_continuous(breaks = seq(from = 0, to = 100, by = 20)) +
  scale_y_continuous(breaks = seq(from = 0, to = 10, by = 2)) +
  coord_cartesian(xlim = c(0, 100)) +
  labs(title = "Behold the Metropolis algorithm in action!",
       subtitle = "The dots show the king's path over the first 100 weeks.",
       x = "Week",
       y = "Island visited") +
  theme_bw() +
  theme(panel.grid = element_blank())

The number of weeks spent on each island is shown below, where island 10 has the largest population and island 1 the smallest. Note we have simmed for 1e5 weeks to show the tend in the long run.

Code

tibble(week   = 1:1e5,
       island = positions) %>%
  mutate(island = factor(island)) %>%
  
  ggplot(aes(x = island)) +
  geom_bar(fill = "salmon") +
  scale_y_continuous("number of weeks", expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Old Metropolis shines in the long run.",
       subtitle = "Sure enough, the time the king spent on each island\nwas proportional to its population size.") +
  theme_bw() +
  theme(panel.grid = element_blank())

This is an example of Markov chain Monte Carlo

This is its simplest form: the metropolis algorithm
Islands = parameter values
Population size = posterior probability (how many ways can it explain the data)
Visit each parameter value in proportion to its posterior probability
We just need to find the posterior probability for two proposals at a time to build the distribution

$~$

Markov chains

$~$

Chain: a sequence of draws from distribution

Markov chain: history doesn’t matter, where you go next only matters on where you are now e.g. how the chain moves through parameter space

Monte Carlo: a numerical algorithm that uses randomisation to perform a calculation

- The King Markov example above follows the Metropolis algorithm.

- The Rosenbluth (Metropolis) was the first used in MCMC, introduced in 1953 (> 47,000 citations)

- We will use the Hamiltonian algorithm

The Rosenbluth algorithm

Moves are proposed, some are rejected and we progressively build a posterior by blindly stumbling around.
Think of parameter space like a bowl, it’s hard to climb the sides to unlikely posterior probabilities
The step size matters. That’s how far each proposal is from the current location.
A large step size is wasteful because many proposals are rejected, while a small step size is also poor as we don’t explore a large parameter space.

Code

# bowl simulation

# this requires a lot of coding but we'll get there

# first we need to simulate a bivariate distribution with a strong negative correlation

# mean vector
mu <- c(0, 0)

# variance/covariance matrix
sd_a1 <- 0.22
sd_a2 <- 0.22
rho   <- -.9

Sigma <- matrix(data = c(sd_a1^2,
                         rho * sd_a1 * sd_a2,
                         rho * sd_a1 * sd_a2,
                         sd_a2^2),
                nrow = 2)

# sample from the distribution with the `mvtnorm::rmvnorm()` function
set.seed(9)

my_samples <- mvtnorm::rmvnorm(n = 1000, mean = mu, sigma = Sigma)

# We won’t actually be using the values from this simulation. Instead, we can evaluate the density function for this distribution using the mvtnorm::dmvnorm() function. But even before that, we’ll want to create a grid of values for the contour lines in Figure 9.3. Here we do so with a custom function called x_y_grid()

# define the function
x_y_grid <- function(x_start = -1.6,
                     x_stop = 1.6,
                     x_length = 100,
                     y_start = -1.6,
                     y_stop = 1.6,
                     y_length = 100) {
  
  x_domain <- seq(from = x_start, to = x_stop, length.out = x_length)
  y_domain <- seq(from = y_start, to = y_stop, length.out = y_length)
  
  x_y_grid_tibble <- tidyr::expand_grid(a1 = x_domain, a2 = y_domain)
  
  return(x_y_grid_tibble)
  
}

# simulate
contour_plot_dat <- x_y_grid()

# now compute density values for each combination of a1 and a2

contour_plot_dat <- 
  contour_plot_dat %>% 
  mutate(d = mvtnorm::dmvnorm(as.matrix(contour_plot_dat), mean = mu, sigma = Sigma))

# now we can create our base contour plot

contour_plot <- 
  contour_plot_dat %>% 
  ggplot() + 
  geom_contour(aes(x = a1, y = a2, z = d), 
               size = 1/8, color = "#FFB300",
               breaks = 9^(-(10 * 1:25))) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw() +
  theme(panel.grid = element_blank(),
        panel.background = element_rect(fill = "#FFF9C4"))

# Now define the metropolis/rosenbluth algorithm

metropolis <- function(num_proposals = 50,
                       step_size = 0.1,
                       starting_point = c(-1, 1)) {
  
  # Initialize vectors where we will keep track of relevant
  candidate_x_history <- rep(-Inf, num_proposals)
  candidate_y_history <- rep(-Inf, num_proposals)
  did_move_history <- rep(FALSE, num_proposals)
  
  # Prepare to begin the algorithm...
  current_point <- starting_point
  
  for(i in 1:num_proposals) {
    
    # "Proposals are generated by adding random Gaussian noise
    # to each parameter"
    
    noise <- rnorm(n = 2, mean = 0, sd = step_size)
    candidate_point <- current_point + noise
    
    # store coordinates of the proposal point
    candidate_x_history[i] <- candidate_point[1]
    candidate_y_history[i] <- candidate_point[2]
    
    # evaluate the density of our posterior at the proposal point
    candidate_prob <- mvtnorm::dmvnorm(candidate_point, mean = mu, sigma = Sigma)
    
    # evaluate the density of our posterior at the current point
    current_prob <- mvtnorm::dmvnorm(current_point, mean = mu, sigma = Sigma)
    
    # Decide whether or not we should move to the candidate point
    acceptance_ratio <- candidate_prob / current_prob
    should_move <- ifelse(runif(n = 1) < acceptance_ratio, TRUE, FALSE)
    
    # Keep track of the decision
    did_move_history[i] <- should_move
    
    # Move if necessary
    if(should_move) {
      current_point <- candidate_point
    }
  }
  
  # once the loop is complete, store the relevant results in a tibble
  results <- tibble::tibble(
    candidate_x = candidate_x_history,
    candidate_y = candidate_y_history,
    accept = did_move_history
  )
  
  # compute the "acceptance rate" by dividing the total number of "moves"
  # by the total number of proposals
  
  number_of_moves <- results %>% dplyr::pull(accept) %>% sum(.)
  acceptance_rate <- number_of_moves/num_proposals
  
  return(list(results = results, acceptance_rate = acceptance_rate))
  
}

# run the algorithm with step size = 0.1 to create the data for the left figure

set.seed(9)

round_1 <- metropolis(num_proposals = 50,
                      step_size = 0.1,
                      starting_point = c(-1,1))
# create the figure

p1 <-
  contour_plot + 
  geom_point(data = round_1$results,
             aes(x = candidate_x, y = candidate_y, 
                 color = accept, shape = accept)) +
  labs(subtitle = str_c("step size 0.1,\naccept rate ", round_1$acceptance_rate),
       x = "a1",
       y = "a2") +
  scale_shape_manual(values = c(21, 19)) +
  scale_color_manual(values = c("black", "#8D6E63"))

# now run the algorithm with step size 0.25

# simulate
set.seed(9)

round_2 <- metropolis(num_proposals = 50,
                      step_size = 0.25,
                      starting_point = c(-1,1))

# plot
p2 <-
  contour_plot + 
  geom_point(data = round_2$results,
             mapping = aes(x = candidate_x, y = candidate_y, 
                           color = accept, shape = accept)) +
  scale_shape_manual(values = c(21, 19)) +
  scale_color_manual(values = c("black", "#8D6E63")) +
  scale_y_continuous(NULL, breaks = NULL, expand = c(0, 0)) +
  labs(subtitle = str_c("step size 0.25,\naccept rate ", round_2$acceptance_rate),
       x = "a1")

# combine

(p1 + p2) + 
  plot_annotation(title = "Metropolis chains under high correlation") +
  plot_layout(guides = "collect")

$~$

Hamiltonian Monte Carlo

$~$

This is like letting a skateboard roll in a bowl and tracking where it stops.
Then from that point for flick the board through the bowl again and repeat.
Once again step size matters
Long steps are susceptible to U-turns which makes sampling inefficient

When posteriors get complicated, algorithms like the Rosenbluth struggle to find proposals that aren’t rejected, meaning they can’t move through parameter space. They get stuck. Hamiltonian Monte Carlo does not share this struggle.

Ok, but our computer needs to know the curvature of the parameter space in order to effectively ‘flick the skateboard’. More formally, we need gradients. How can we program this?

We certainly don’t want to write them for every point in the parameter space
Auto-diff (automatic differentiation) saves the day
How this works is a bit beyond me, but the computer knows how (Stan <3)

$~$

Analysing the data: Divorce dataset

$~$

Note that McElreath introduces the Wines2012 dataset in his lecture, but I’m lazy so I use a model we’ve already fit.

We already have been using Hamiltonian Monte Carlo to fit our brms models. So nothing really new here for us. In fact we can just reload our previously fitted Marriage model from Lecture 5.

Code

Marriage_age_model <- 
  brm(file = "fits/Marriage_age_model")

Marriage_age_model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: d ~ 1 + a 
   Data: Waffle_data (Number of observations: 50) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.00      0.10    -0.20     0.20 1.00     3537     2678
a            -0.57      0.12    -0.79    -0.34 1.00     3801     2753

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.82      0.09     0.68     1.01 1.00     2735     2675

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Note in our summary output we have a number of diagnostics.

$~$

Diagnostics

$~$

Trace plots

Code

post <- posterior_samples(Marriage_age_model, add_chain = T)

mcmc_trace(post[, c(1:3, 8)],  # we need to include column 7 because it contains the chain info 
           facet_args = list(ncol = 1), 
           size = .15) +
  labs(title = "My custom trace plots") +
  theme_minimal() +
  theme(panel.grid = element_blank())

These plot a set number of samples from the Markov chain for each parameter

Warm-up:

Early chains where stan is figuring out step size and how long the trajectory should be.
Warmup samples are not representative of the target posterior distribution, no matter how long warmup continues. They are not burning in, but rather more like cycling the motor to heat things up and get ready for sampling. When real sampling begins, the samples will be immediately from the target distribution, assuming adaptation was successful.
We don’t use it for estimation, and it is not shown in our plot above.

Good trace plots look stationary. That means they are centered around an area of high probability i.e. the lower part of the half-pipe

It is important to run more than 1 chain to check for convergence.

Convergence: each chain explores the right distribution and every chain explores the same distribution
Start them at different points in parameter space and they should converge on the same posterior
4 chains is generally good

We can also plot the chains with plot

Code

plot(Marriage_age_model)

$~$

Trace rank plots (Trank plots)

x axis is sequential samples, like a trace plot
y axis shows rank orders across the chains
Quick visual way to check if one of the chains is consistently above or below all of the others
Ideally the chain on top should be constantly switching

Code

post %>% 
  mcmc_rank_overlay(pars = vars(b_Intercept:sigma)) +
  ggtitle("Trank plots") +
  scale_color_manual(values = c("#7CB342", "#FDD835", "#FB8C00", "#F4511E")) +
  coord_cartesian(ylim = c(25, NA)) +
  theme_bw() +
  theme(legend.position = c(.95, .2))

$~$

R-hat

When chains converge we want:

Start and end of each chain to explore the same region e.g. consistency across samples
Independent chains to explore the same region e.g. consistency across chains

R-hat: A ratio of variances

As total variance among all chains shrinks to average variance within chains, R-hat approaches 1
All variance should ideally be within chains not between them if they converge on the same posterior distribution
1 is good, but it also doesn’t ensure your sampling is good

$~$

n_eff

The effective number of samples

Put another way “How long would the chain be, if each sample was independent of the one before it?”

Because samples are autocorrelated, that is those in close sequential order are often exploring a similar parameter space, they convey less information than would a truly independent sampling approach.
The greater this autocorrelation, the fewer effective samples you have.
This doesn’t mean it’s bad, it’s just inefficient and you’ll have to sample for longer.
n_eff will nearly always be smaller than the specified number of samples because of autocorrelation.

brms returns two values: Bulk_ESS and Tail_ESS

Bulk_ESS = n_eff
Tail_ESS = n_eff at the extremes of intervals

Code

post %>% 
  mcmc_acf(pars = vars(b_Intercept:sigma),
           lags = 5) +
  theme(panel.grid = element_blank())

This is a healthy autocorrelation plot

$~$

Divergent transitions

This is a kind of rejected proposal
If the posterior distribution has a very steep gradient it can lead to many rejected proposals, even for HMC
They aren’t bad in isolation, Rosenbluth algorithm has loads
Lots of divergent transitions mean that the algorithm is exploring the posterior distribution very poorly and that it might be stuck in one spot and missing some of the parameter space

$~$

Bad chains

$~$

Lets fit an awful model. Give it two data points and stupidly flat priors.

Code

b9.2 <-
  brm(data = list(y = c(-1, 1)), 
      family = gaussian,
      y ~ 1,
      prior = c(prior(normal(0, 1000), class = Intercept),
                prior(exponential(0.0001), class = sigma)),
      iter = 2000, warmup = 1000, chains = 3,
      seed = 9,
      file = "fits/b09.02")

b9.2

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 
   Data: list(y = c(-1, 1)) (Number of observations: 2) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.37    310.88  -691.22   683.09 1.08      472      385

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   515.45   1217.56    13.14  3568.10 1.05       44       78

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The trace plot

Code

post <- posterior_samples(b9.2, add_chain = T)

p1 <-
  post %>% 
  mcmc_trace(pars = vars(b_Intercept:sigma),
             size = .25)

p2 <-
  post %>% 
  mcmc_rank_overlay(pars = vars(b_Intercept:sigma))

(
  (p1 / p2) &
    scale_color_manual(values = c("#7CB342", "#FDD835", "#FB8C00", "#F4511E")) &
    theme(legend.position = "none")
) +
  plot_annotation(subtitle = "These chains are not healthy")

Very bad.

The problem with this model is the priors. Can we fix it with better priors?

Code

b9.3 <-
  brm(data = list(y = c(-1, 1)), 
      family = gaussian,
      y ~ 1,
      prior = c(prior(normal(1, 10), class = Intercept),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 3,
      seed = 9,
      file = "fits/b09.03")

b9.3

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 
   Data: list(y = c(-1, 1)) (Number of observations: 2) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.03      1.13    -2.21     2.50 1.00      945      624

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.54      0.84     0.60     3.71 1.00      909     1245

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The trace plot

Code

post <- posterior_samples(b9.3, add_chain = T)

p1 <-
  post %>% 
  mcmc_trace(pars = vars(b_Intercept:sigma),
             size = .25)

p2 <-
  post %>% 
  mcmc_rank_overlay(pars = vars(b_Intercept:sigma))

(
  (p1 / p2) &
    scale_color_manual(values = c("#7CB342", "#FDD835", "#FB8C00", "#F4511E")) &
    theme(legend.position = "none")
) +
  plot_annotation(subtitle = "Weakly informative priors to the rescue")

Note that sigma is upwardly distributed. This is normal for a scale parameter.

$~$

Lecture 9: Modelling events

$~$

Modelling events

Events: discrete, unordered outcomes
Observations are counts
Unknowns are probabilities, or odds (transformed probabilities)
Odds are probs event happens / prob it does not
Everything interacts always everywhere!
A beast known as ‘log-odds’ - the logarithm of the odds.

$~$

UC Berkeley admissions

$~$

4526 graduate school applications for 1973 UC Berkeley. The data is stratified by department and gender.

The idea was to test for gender discrimination.

Code

data(UCBadmit, package = "rethinking")
UCB_admissions <- UCBadmit %>% as_tibble()
rm(UCBadmit)

UCB_admissions

# A tibble: 12 × 5
   dept  applicant.gender admit reject applications
   <fct> <fct>            <int>  <int>        <int>
 1 A     male               512    313          825
 2 A     female              89     19          108
 3 B     male               353    207          560
 4 B     female              17      8           25
 5 C     male               120    205          325
 6 C     female             202    391          593
 7 D     male               138    279          417
 8 D     female             131    244          375
 9 E     male                53    138          191
10 E     female              94    299          393
11 F     male                22    351          373
12 F     female              24    317          341

Right, let’s start with the OWL infrastructure.

and 2) together. Find the estimand and describe the scientific model

DAG time

Code

dag_coords <-
  tibble(name = c("G", "D", "A"),
         x    = c(1, 2, 3),
         y    = c(1, 2, 1))

dagify(A ~ G + D,
       D ~ G,
       coords = dag_coords) %>%
  gg_simple_dag()

G = Gender

D = Department e.g. science, arts etc - some departments have high competitive admission, whereas others have low, where it’s easier to achieve admission.

A = Admission

Department is a mediator. This is a very common causal structure.

What do we mean by the causal effect of gender?

How does the referee perceive someone based on gender. This can be blinded, at least to some extent.
But gender also affects where someone applies…

Which of the paths in this DAG are discrimination?

The direct path is status-based or taste-based (refs are biased for or against individuals of particular genders)
Indirect path is structural discrimination e.g. unequal access to higher education by greater funding for a gender biased program
Total: both paths: this is what people experience

Build the model with some simulated data

Code

# generative model, basic mediator scenario

N <- 1000 # number of applications

G <- sample(1:2, size = N, replace = T) # close to even gender distribution, like flipping a coin N times

D <- rbernoulli(N, if_else(G ==1, 0.3, 0.8)) + 1 # gender 1 individuals tend to apply to department 1, G2 applies to d 2 more often

# matrix of acceptance rates

accept_rate <- matrix(c(0.1, 0.3, 0.1, 0.3), nrow = 2) # rows are departments, columns are genders. Note we code no discrimination

# simulate acceptance

A <-rethinking::rbern(N, accept_rate[D, G])


table(G, D)

   D
G     1   2
  1 349 136
  2 109 406

The Table illustrates that Gender 1 applies to department 1 more often than does gender 2.

Code

table(G, A)

   A
G     0   1
  1 401  84
  2 385 130

This Table shows that Gender 2 gets accepted more often than Gender 1. This is because it’s easier to get into department 2. This is the mediating path.

Now let’s code in some direct discrimination

Code

(
  accept_rate_dis <- matrix(c(0.05, 0.2, 0.1, 0.3), nrow = 2)
)

     [,1] [,2]
[1,] 0.05  0.1
[2,] 0.20  0.3

Code

A_dis <- rethinking::rbern(N, accept_rate_dis[D, G])

Here are the new results

Code

table(G, D)

   D
G     1   2
  1 349 136
  2 109 406

Code

table(G, A)

   A
G     0   1
  1 401  84
  2 385 130

We get the same pattern! This means that we cannot tell which causal path is dictating the effect. So we must turn to the backdoor criteria.

$~$

Generalised Linear Models

$~$

A linear model is part of a larger family of models called generalised linear models. Generalised linear models have an expected value that is some function of an additive combination of parameters (linear models are this but without the function).

The linear model can take any real value, but if we have a bernoulli data generating process we need to limit these values within a range of 0 and 1.

Distributions:

The relative number of ways to observe data, given assumptions
Distributions are matched to constraints on the observed data e.g. counts can’t be negative
Link functions match distributions (brms will give you these)

The notation:

$Y_{i} = Bernoulli(p_i)$

$f(p_i) = \alpha + \beta_XX_i + B_ZZ_i$

$Y_i$ is a 0/1 outcome

$p_i$ is the probability of the event

$\alpha + \beta_XX_i + B_ZZ_i$ is identical to the linear model and can take any real value

$f()$ maps the probability scale to the linear scale

$~$

Link functions

$f$ is the link function.
Links parameters of distribution to linear model

Code

tibble(x = seq(from = -1, to = 3, by = .01)) %>%
  mutate(probability = .35 + x * .5) %>% 
  
  ggplot(aes(x = x, y = probability)) +
  geom_rect(xmin = -1, xmax = 3,
            ymin = 0,  ymax = 1,
            fill = met.brewer("Hiroshige")[6]) +
  geom_hline(yintercept = 0:1, linetype = 2, color = met.brewer("Hiroshige")[5]) +
  geom_line(aes(linetype = probability > 1, color = probability > 1),
            size = 1) +
  geom_segment(x = 1.3, xend = 3,
               y = 1, yend = 1,
               size = 2/3, color = met.brewer("Hiroshige")[3]) +
  annotate(geom = "text",
           x = 1.28, y = 1.04, hjust = 1,
           label = "This is why we need link functions",
           color = met.brewer("Hiroshige")[4], size = 2.6) + 
  scale_color_manual(values = c(met.brewer("Hiroshige")[3:4])) +
  scale_y_continuous(breaks = c(0, .5, 1)) +
  coord_cartesian(xlim = c(0, 2),
                  ylim = c(0, 1.2)) +
  theme(legend.position = "none",
        panel.background = element_rect(fill = met.brewer("Hiroshige")[7]),
        panel.grid = element_blank())

$f^{-1}$ is the inverse link function. The purpose of this is to undo the effect of the link function.

What the computer is actually doing is this:

$p_i = f^{-1}(\alpha + \beta_XX_i + B_ZZ_i)$

It undoes the transformation

$~$

The logit link

Bernoulli/binomial models often use the logit link.

This is the log-odds of the event happening.

$logit(p_i) = log \frac{p_i}{1 - p_i}$

$\frac{p_i}{1 - p_i}$ is the odds

Why do we take the log of the odds? It is the right transform to convert the real number line to the probability interval 0-1 without causing problems.

This wording helps explain why we use the inverse function

The logit function makes S curves or logistic curves. This is because they must bend and flatten at 0 and 1

Code

# first, we'll make data for the horizontal lines
alpha <- 0
beta  <- 4

lines <-
  tibble(x = seq(from = -1, to = 1, by = .25)) %>% 
  mutate(`log-odds`  = alpha + x * beta,
         probability = exp(alpha + x * beta) / (1 + exp(alpha + x * beta)))

# now we're ready to make the primary data
beta  <- 2
d <-
  tibble(x = seq(from = -1.5, to = 1.5, length.out = 50)) %>% 
  mutate(`log-odds`  = alpha + x * beta,
         probability = exp(alpha + x * beta) / (1 + exp(alpha + x * beta))) 

# now we make the individual plots
p1 <-
  d %>% 
  ggplot(aes(x = x, y = `log-odds`)) +
  geom_hline(data = lines,
             aes(yintercept = `log-odds`),
             color = met.brewer("Hiroshige")[2]) +
  geom_line(size = 1.5, color = met.brewer("Hiroshige")[7]) +
  coord_cartesian(xlim = c(-1, 1)) +
  theme(panel.background = element_rect(fill = met.brewer("Hiroshige")[5]),
        panel.grid = element_blank())

p2 <-
  d %>% 
  ggplot(aes(x = x, y = probability)) +
  geom_hline(data = lines,
             aes(yintercept = probability),
             color = met.brewer("Hiroshige")[2]) +
  geom_line(size = 1.5, color = met.brewer("Hiroshige")[7]) +
  coord_cartesian(xlim = c(-1, 1)) +
  theme(panel.background = element_rect(fill = met.brewer("Hiroshige")[5]),
        panel.grid = element_blank())

# finally, we're ready to mash the plots together and behold their nerdy glory
(p1 | p2) +
  plot_annotation(subtitle = "The logit link transforms a linear model (left) into a probability (right).")

How to read the log-odds scale:

logit(p) = 0, p = 0.5 (happens half the time)

logit(p) = 4, p = nearly always happens (very close to 1 odds)

logit(p) = -4, p = nearly never happens (very close to 0 odds)

$~$

Setting priors:

We need to know what happens when we convert the prior distribution to the probability scale
For , if we set wide priors we make the model assume that the event will either never or always happen
If we set a prior of ~ Normal(0, 1) we start to get some skepticism of extreme values but not much! - See plot below
For : to not assume very strong relationships we need something like: ~ Normal(0, 0.5)

$~$

Lets look at how bad even Normal(0, 2) for an intercept prior is:

Code

hist(rnorm(1000, 0, 2))

Code

data <- rnorm(1000, 0, 2) %>% 
  as_tibble %>%  
  mutate(prob_space = inv_logit_scaled(value)) 

p1 <- 
  data %>% 
  ggplot(aes(x = value)) +
  geom_histogram() +
  xlab("log-odds")

p2 <- 
  data %>% 
  ggplot(aes(x = prob_space)) +
  geom_histogram() +
  xlab("probability")

p1 + p2

$~$

Analysing UC Berkelely data

$~$

Look at the data again

Code

UCB_admissions

# A tibble: 12 × 5
   dept  applicant.gender admit reject applications
   <fct> <fct>            <int>  <int>        <int>
 1 A     male               512    313          825
 2 A     female              89     19          108
 3 B     male               353    207          560
 4 B     female              17      8           25
 5 C     male               120    205          325
 6 C     female             202    391          593
 7 D     male               138    279          417
 8 D     female             131    244          375
 9 E     male                53    138          191
10 E     female              94    299          393
11 F     male                22    351          373
12 F     female              24    317          341

First we should talk about the difference between Bernoulli and Binomial regression.

They are interchangeable from an inference perspective
Bernoulli data is displayed as 0, 1 data, while binomial variables are aggregated counts of a bernoulli process.

Note that the data above is aggregated, so we should specify a binomial distribution in our upcoming model

First let’s get the total effect of gender on admission acceptance

Code

UCB_admissions_model <- 
   brm(data = UCB_admissions, 
      family = binomial,
      admit | trials(applications) ~ 0 + applicant.gender,
      prior = prior(normal(0, 1), class = b),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/UCB_admissions_model")

UCB_admissions_model

 Family: binomial 
  Links: mu = logit 
Formula: admit | trials(applications) ~ 0 + applicant.gender 
   Data: UCB_admissions (Number of observations: 12) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
applicant.genderfemale    -0.83      0.05    -0.93    -0.73 1.00     2481
applicant.gendermale      -0.22      0.04    -0.29    -0.14 1.00     2387
                       Tail_ESS
applicant.genderfemale     2063
applicant.gendermale       1968

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Output tables suck when we have lots of parameters. Contrasts between treatment levels are far more informative.

Let’s find the difference in admissions between genders

Code

post <- as_draws_df(UCB_admissions_model) %>% 
  mutate(f_mean = inv_logit_scaled(b_applicant.genderfemale),
         m_mean = inv_logit_scaled(b_applicant.gendermale),
         diff = f_mean - m_mean)

post %>% 
    # plot
  ggplot(aes(x = diff)) +
  geom_vline(xintercept = 0, linetype = 2) +
  stat_halfeye(point_interval = median_qi, .width = .95,
               color = "black",
               fill = met.brewer("Hiroshige")[4]) +
  labs(x = " Mean effect of being female on application acceptance",
       y = NULL) +
  coord_cartesian(xlim = c(-0.25, 0)) +
  theme(axis.ticks.y = element_blank(),
        legend.position = "none") +
  theme_tidybayes()

Now lets estimate the direct effect

Code

UCB_admissions_model_direct <- 
  brm(data = UCB_admissions, 
      family = binomial,
      admit | trials(applications) ~ 1 + applicant.gender * dept,
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/UCB_admissions_model_direct")

UCB_admissions_model_direct

 Family: binomial 
  Links: mu = logit 
Formula: admit | trials(applications) ~ 1 + applicant.gender * dept 
   Data: UCB_admissions (Number of observations: 12) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
                           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                      0.96      0.18     0.61     1.32 1.01      918
applicant.gendermale          -0.46      0.19    -0.82    -0.09 1.01      885
deptB                         -0.13      0.41    -0.89     0.68 1.00     1259
deptC                         -1.61      0.20    -1.99    -1.22 1.01     1015
deptD                         -1.56      0.21    -1.97    -1.15 1.01     1005
deptE                         -2.08      0.21    -2.49    -1.67 1.01     1062
deptF                         -3.41      0.25    -3.92    -2.92 1.01     1236
applicant.gendermale:deptB     0.16      0.42    -0.66     0.92 1.00     1234
applicant.gendermale:deptC     0.56      0.24     0.08     1.03 1.01     1127
applicant.gendermale:deptD     0.35      0.24    -0.11     0.81 1.01     1070
applicant.gendermale:deptE     0.60      0.26     0.08     1.10 1.01     1211
applicant.gendermale:deptF     0.11      0.33    -0.56     0.73 1.01     1263
                           Tail_ESS
Intercept                      1252
applicant.gendermale           1247
deptB                          1628
deptC                          1303
deptD                          1300
deptE                          1343
deptF                          1646
applicant.gendermale:deptB     1642
applicant.gendermale:deptC     1528
applicant.gendermale:deptD     1657
applicant.gendermale:deptE     1570
applicant.gendermale:deptF     1881

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s plot the distribution of the difference for each department

Code

new_data <- expand_grid(applications = 1, applicant.gender = UCB_admissions$applicant.gender,
                        dept = UCB_admissions$dept) %>%
  distinct() %>% 
  mutate(id = paste("V", 1:12, sep = "")) 
  

as_tibble(fitted(UCB_admissions_model_direct, newdata = new_data, summary = F)) %>% 
  mutate(posterior_draw = 1:n(),
         dep_a_contrast = V1 - V7,
         dep_b_contrast = V2 - V8,
         dep_c_contrast = V3 - V9,
         dep_d_contrast = V4 - V10,
         dep_e_contrast = V5 - V11,
         dep_f_contrast = V6 - V12) %>%
  select(contains("contrast")) %>% 
  tidyr::gather(key = id, value = gender_contrast) %>%
  mutate(dep = str_remove(id, "dep_"),
         dep = str_remove(dep, "_contrast")) %>% 

  ggplot(aes(dep, gender_contrast)) + 
  stat_halfeye(aes(fill = dep), .width = c(0.66, 0.95)) + # width indicates the uncertainty intervals: here we have 66% and 95% intervals
  scale_fill_manual(values = met.brewer("Monet")) +
  coord_flip(ylim = c(-0.3, 0.3)) +
  geom_hline(yintercept = 0, linetype = 2) +
  ylab("Gender constrast (probability)") +
  xlab("Department") +
  theme_minimal() + 
  annotate("text", label = "Women\nadvantaged",
    x = 6.5, y = -0.26, size = 4.2, colour = "black") +
    annotate("text", label = "Men\nadvantaged",
    x = 6.5, y = 0.26, size = 4.2, colour = "black") +
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

There is one department where women are advantaged over men, but none where men and advantaged. The 14% penalty found in the total effect has disappeared

$~$

Survival analysis bonus

$~$

Survival analysis is a form of count model. Once again the parameters are about rates.

What to do with data that wasn’t counted? i.e. you have some data you were collecting that did not reach the end point e.g. a cat escaped before it was adopted, a fly line was undone by a mistake rather than an extinction event.

We can tackle this with censoring:

Left censoring: don’t know when time started
Right censoring: something cut observation off before the event occurred

Censoring matters because we bias our data by dropping the data e.g. imagine estimating PhD time to completion but leaving out all the drop-outs

Example: cat adoptions:

Code

data(AustinCats, package = "rethinking")
cat_data <- as_tibble(AustinCats)
rm(AustinCats)

20,000 cats
Time to event (out-event) - most cats were adopted
Event either 1 (adopted) or 2 (something else - death, escape, censored - the cat is still not adopted at time of data download)
Cat colour is also known

Data distributions we can use:

Exponential:

From McElreath

“It is a fundamental distribution of distance and duration, kinds of measurements that represent displacement from some point of reference, either in time or space. If the probability of an event is constant in time or across space, then the distribution of events tends towards exponential. The exponential distribution has maximum entropy among all non-negative continuous distributions with the same average displacement.” (p. 314).

The rate of decay is constant e.g. like nuclear decay with half-lives
Log-link

$f(y) = \lambda e^{-\lambda y}$

Where y is a non-negative continuous variable and is called the rate.

Importantly, the mean of the exponential distribution is the inverse of the rate:

$E[y] = \frac{1}{\lambda}$

This is important because it is how brms parameterises exponential models.

Lets get the data ready to use:

Code

cat_data <- cat_data %>% 
  mutate(black = if_else(color == "Black", "black", "other"),
         adopted  = ifelse(out_event == "Adoption", 1, 0),
         censored = ifelse(out_event != "Adoption", 1, 0))

How are the two events distributed i.e. adoption versus censor

Code

cat_data %>% 
  mutate(censored = factor(censored)) %>% 
  filter(days_to_event < 300) %>% 
  
  ggplot(aes(x = days_to_event, y = censored)) +
  # let's just mark off the 50% intervals
  stat_halfeye(.width = .5, fill = met.brewer("Hokusai2")[2], height = 4) +
  scale_y_discrete(NULL, labels = c("censored == 0", "censored == 1")) +
  coord_cartesian(ylim = c(1.5, 5.1)) +
  theme(axis.ticks.y = element_blank()) +
  theme_tidybayes()

Do note there is a very long right tail that we’ve cut off for the sake of the plot. Anyway, the point of this plot is to show that the distribution for our primary variable, days_to_event, looks very different conditional on whether the data were censored. As McElreath covered in the lecture, we definitely don’t want to lose that information by excluding the censored cases from the analysis.

Fit the model

Code

b11.15 <-
  brm(data = cat_data,
      family = exponential,
      days_to_event | cens(censored) ~ 0 + black,
      prior(normal(0, 1), class = b),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.15")

print(b11.15)

 Family: exponential 
  Links: mu = log 
Formula: days_to_event | cens(censored) ~ 0 + black 
   Data: cat_data (Number of observations: 22356) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
blackblack     4.05      0.03     4.00     4.10 1.00     3122     2128
blackother     3.88      0.01     3.86     3.90 1.00     2925     2466

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Since we modelled $log \mu_{i}$, we need to transform our parameters back into the metric using the formula:

$log \mu = \alpha_{black}$, $\lambda = 1 / \mu$ $\lambda_{black} = 1 / exp(\alpha_{black})$

Like so

Code

1 / exp(fixef(b11.15)[, -2])

             Estimate       Q2.5      Q97.5
blackblack 0.01740956 0.01834988 0.01654450
blackother 0.02065263 0.02106332 0.02026317

Still struggling? Let’s plot

Code

# annotation
text <-
  tibble(color = c("black", "other"),
         days  = c(40, 34),
         p     = c(.55, .45),
         label = c("black cats", "other cats"),
         hjust = c(0, 1))

# wrangle
f <-
  fixef(b11.15) %>% 
  data.frame() %>% 
  rownames_to_column() %>% 
  mutate(color = str_remove(rowname, "black")) %>% 
  # most important step. We predict over 100 days, assuming a constant rate of decline
  expand(nesting(Estimate, Q2.5, Q97.5, color), 
         days = 0:100) %>% 
  mutate(m  = 1 - pexp(days, rate = 1 / exp(Estimate)),
         ll = 1 - pexp(days, rate = 1 / exp(Q2.5)),
         ul = 1 - pexp(days, rate = 1 / exp(Q97.5)))
  
# plot!
f %>% 
  ggplot(aes(x = days)) +
  geom_hline(yintercept = .5, linetype = 3, color = met.brewer("Hokusai2")[2]) +
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = color),
              alpha = 1/2) +
  geom_line(aes(y = m, color = color)) +
  geom_text(data = text,
            aes(y = p, label = label, hjust = hjust, color = color),
            family = "Times") +
  scale_fill_manual(values = met.brewer("Hokusai2")[c(4, 1)], breaks = NULL) +
  scale_color_manual(values = met.brewer("Hokusai2")[c(4, 1)], breaks = NULL) +
  scale_y_continuous("proportion remaining", breaks = c(0, .5, 1), limits = 0:1) +
  xlab("days to adoption")

Another way to explore this model is to ask: About how many days would it take for half of the cats of a given colour to be adopted? We can do this with help from the qexp() function. For example:

Code

# wrangle
post <-
  posterior_samples(b11.15) %>% 
  pivot_longer(starts_with("b_")) %>% 
  mutate(color = str_remove(name, "b_black"),
         days  = qexp(p = .5, rate = 1 / exp(value))) 

# axis breaks
medians <-
  group_by(post, color) %>% 
  summarise(med = median(days)) %>% 
  pull(med) %>% 
  round(., digits = 1)

# plot!
post %>% 
  ggplot(aes(x = days, y = color)) +
  stat_halfeye(.width = .95, fill = met.brewer("Hokusai2")[2], height = 4) +
  scale_x_continuous("days untill 50% are adopted",
                     breaks = c(30, medians, 45), labels = c("30", medians, "45"),
                     limits = c(30, 45)) +
  ylab(NULL) +
  coord_cartesian(ylim = c(1.5, 5.1)) +
  theme(axis.ticks.y = element_blank())

The model suggests it takes about six days longer for the half of the black cats to be adopted.

$~$

Lecture 10: Counts and Confounds

$~$

Linear models: expected value is additive (linear) combination of parameters. What you see is what you get.

Generalised linear models: these are like tide prediction models. They have many gears that are difficult to interpret. More formally, the expected value is some function of an additive combination of parameters. If we leave this expected value on the function scale, it means very little to the common man. We must strive to show the ‘tide prediction’.

It is very important to note that uniform changes in the predictor do not lead to uniform changes in prediction. I.e. think of how values change on the log-odds scale.
All predictor variables interact - if one predictor pushes the prediction towards 0 or 1 then the effect of other predictors will be smaller.

$~$

UC Berkeley continued - Confounded admissions

$~$

Code

dag_coords <-
  tibble(name = c("G", "D", "A", "u"),
         x    = c(1, 2, 3, 3.2),
         y    = c(1, 2, 1, 2.2))

dagify(A ~ G + D + u,
       D ~ G + u,
       coords = dag_coords) %>%
  gg_simple_dag()

What if $u$, the students ability affects both department choice and acceptance rate. We do not have this in our dataset, but it still may be present in our scientific model.

Let’s simulate such a situation. Here are some general tips for simulation:

Simulate variables with no parents (they don’t depend on other variables)
Simulate the values that require parents

Code

set.seed(17)

N <- 2000 # number of applicants

# even gender distribution

G <- sample(1:2, size = N, replace = TRUE)

# sample ability, high (1) to average (0)

u <- rethinking::rbern(N, 0.1) 

# gender 1 tends to apply to department 1, 2 to 2
# also G =1 with greater ability tend to apply to 2 as well

D <- rethinking::rbern(N, if_else(G == 1, u * 1, 0.75)) + 1 # individuals of high ability have a 50% chance of applying for department 2
# matrix of acceptance rates
accept_rate_u0 <- matrix(c(0.1, 0.1, 0.1, 0.3), nrow = 2) # average ability, with discrimintation in dep 2
accept_rate_u1 <- matrix(c(0.3, 0.3, 0.5, 0.5), nrow = 2) # high ability

# simulate acceptance
P <- sapply(1:N, function(i)
  if_else(u[i] == 0, accept_rate_u0[D[i], G[i]],
          accept_rate_u1[D[i], G[i]] ))

A <- rethinking::rbern(N, P)

data_sim <- tibble(Accepted = A, Department = as.factor(D), Gender = as.factor(G))

Model simulated data

The total effect of gender

Code

UCB_admissions_model_ability <- 
   brm(data = data_sim, 
      family = bernoulli,
      Accepted ~ 0 + Gender,
      prior = prior(normal(0, 1), class = b),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/UCB_admissions_model_ability")

UCB_admissions_model_ability

 Family: bernoulli 
  Links: mu = logit 
Formula: Accepted ~ 0 + Gender 
   Data: data_sim (Number of observations: 2000) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Gender1    -1.91      0.09    -2.09    -1.74 1.01     2500     1946
Gender2    -1.09      0.07    -1.23    -0.95 1.00     2746     1836

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

This model gets the total effect of gender correct. Gender 1 is disadvantaged overall.

The direct effect of gender

Code

UCB_admissions_model_ability_direct <- 
  brm(data = data_sim, 
      family = bernoulli,
      Accepted ~ 1 + Gender * Department,
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/UCB_admissions_model_ability_direct")

UCB_admissions_model_ability_direct

 Family: bernoulli 
  Links: mu = logit 
Formula: Accepted ~ 1 + Gender * Department 
   Data: data_sim (Number of observations: 2000) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.12      0.11    -2.33    -1.92 1.00     2287     2096
Gender2                -0.28      0.24    -0.76     0.17 1.00     1786     1653
Department2             1.24      0.23     0.79     1.70 1.00     1881     1772
Gender2:Department2     0.35      0.32    -0.27     0.98 1.00     1529     1668

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model 2 is confounded. To show this we should look at the contrasts - don’t look at the gears!

Plot the contrast

Code

new_data <- expand_grid(Gender = c(1, 2),
                        Department = c(1, 2)) %>%
  mutate(id = paste("V", 1:4, sep = "")) 
  

as_tibble(fitted(UCB_admissions_model_ability_direct, newdata = new_data, summary = F)) %>% 
  mutate(posterior_draw = 1:n(),
         dep_1_contrast = V1 - V3,
         dep_2_contrast = V2 - V4) %>%
  select(contains("contrast")) %>% 
  tidyr::gather(key = id, value = gender_contrast) %>%
  mutate(dep = str_remove(id, "dep_"),
         dep = str_remove(dep, "_contrast")) %>% 

  ggplot(aes(dep, gender_contrast)) + 
  stat_halfeye(aes(fill = dep), .width = c(0.66, 0.95)) + # width indicates the uncertainty intervals: here we have 66% and 95% intervals
  scale_fill_manual(values = met.brewer("Monet")) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = 2) +
  ylab("F-M constrast (probability)") +
  xlab("Department") +
  theme_minimal() + 
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

Return of the collider!!

Stratifying by D opens non-causal path through u.
This means we can’t estimate the direct effect of D or G

What’s happening here is that Gender 1 individuals of high ability are willing to take on the risk of discrimination and apply for the discriminatory department. This creates a higher proportion of high ability G1s compared to G2s in that department. High ability compensates for discrimination and hides the discriminatory effect.

$~$

Sensitivity analysis

$~$

What are the implications of what we don’t know?

Assume confound exists, model its consequences for different strengths/kinds of influences
How strong must the confound be to change conclusions
Wow
You can vary the effect size of the confound to see what effect this has on the posterior
This allows you to get a posterior estimate of an effect while controlling for an assumed confound. You just need to estimate a reasonable effect size for the confound.

$~$

Fit a model with u added. Constrain the effect of u to be positive with a prior i.e. high ability only can have a positive effect on acceptance.

Code

data_sim$u <- u

UCB_admissions_model_ability_direct_u <- 
  brm(data = data_sim, 
      family = bernoulli,
      Accepted ~ 1 + Gender * Department + u,
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/UCB_admissions_model_ability_direct_u")

new_data <- expand_grid(Gender = c(1, 2),
                        Department = c(1, 2),
                        u = 1) %>%
  mutate(id = paste("V", 1:4, sep = "")) 

as_tibble(fitted(UCB_admissions_model_ability_direct_u, newdata = new_data, summary = F)) %>% 
  mutate(posterior_draw = 1:n(),
         dep_1_contrast = V1 - V3,
         dep_2_contrast = V2 - V4) %>%
  select(contains("contrast")) %>% 
  tidyr::gather(key = id, value = gender_contrast) %>%
  mutate(dep = str_remove(id, "dep_"),
         dep = str_remove(dep, "_contrast")) %>% 

  ggplot(aes(dep, gender_contrast)) + 
  stat_halfeye(aes(fill = dep), .width = c(0.66, 0.95)) + # width indicates the uncertainty intervals: here we have 66% and 95% intervals
  scale_fill_manual(values = met.brewer("Monet")) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = 2) +
  ylab("F-M constrast (probability)") +
  xlab("Department") +
  theme_minimal() + 
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

See now that the bias becomes clear after stratifying on ability.

In reality we rarely have data on these unknown confounds.

$~$

Poisson counts: Oceanic tool use

$~$

A new example: Oceanic tool use

$~$

Code

data(Kline2, package = "rethinking")

Oceanic_tools <- as_tibble(Kline)

Oceanic_tools <-
  Oceanic_tools %>% 
  mutate(log_pop_std = (log(population) - mean(log(population))) / sd(log(population)),
         contact_id = if_else(contact == "high", 2, 1),
         contact_id = as.factor(contact_id)) # high contact is coded as 2, low as 1

Oceanic_tools

# A tibble: 10 × 7
   culture    population contact total_tools mean_TU log_pop_std contact_id
   <fct>           <int> <fct>         <int>   <dbl>       <dbl> <fct>     
 1 Malekula         1100 low              13     3.2    -1.29    1         
 2 Tikopia          1500 low              22     4.7    -1.09    1         
 3 Santa Cruz       3600 low              24     4      -0.516   1         
 4 Yap              4791 high             43     5      -0.329   2         
 5 Lau Fiji         7400 high             33     5      -0.0443  2         
 6 Trobriand        8000 high             19     4       0.00667 2         
 7 Chuuk            9200 high             40     3.8     0.0981  2         
 8 Manus           13000 low              28     6.6     0.324   1         
 9 Tonga           17500 high             55     5.4     0.519   2         
10 Hawaii         275000 low              71     6.6     2.32    1

Population size and contact with other populations increase innovation.
The number of tools increases the likelihood of losing tools

DAG time!

Code

dag_coords <-
  tibble(name = c("C", "P", "L", "T"),
         x    = c(1, 1, 2, 2),
         y    = c(2, 1, 2, 1))

dagify(C ~ L + P,
       P ~ L,
       T ~ C + P + L,
       coords = dag_coords) %>%
  gg_simple_dag()

Our goal is to estimate the effect of population size on tools.

Let’s identify all of the paths:

P -> T - front door path
P -> C -> T - C is a competing cause of tool count
P -> C <- L -> T
P <- L -> T - L creates a fork. This is a back-door path

$~$

Modelling tools - the poisson

$~$

Tool count is not binomial: this is because there is no maximum

We can instead use the poisson distribution

Very high maximum and very low probability of each success
For example, in these data there are many possible technologies, but very few are realised in any one place
The link function for the poisson distribution is the log link
This link function enforces positivity, as needed for a count
Note that this creates exponential scaling

Priors

If we continue our convention of using a Normal prior on the parameters, we should recognize those will be log-Normal distributed on the outcome scale. Why? Because we’re modelling the rate with the log link.

Lets plot an alpha prior of normal(0, 10) and one of Normal(3, 0.5)

Code

tibble(x       = c(3, 22),
       y       = c(0.055, 0.04),
       meanlog = c(0, 3),
       sdlog   = c(10, 0.5)) %>% 
  expand(nesting(x, y, meanlog, sdlog),
         number = seq(from = 0, to = 100, length.out = 200)) %>% 
  mutate(density = dlnorm(number, meanlog, sdlog),
         group   = str_c("alpha%~%Normal(", meanlog, ", ", sdlog, ")")) %>% 
  
  ggplot(aes(fill = group, color = group)) +
  geom_area(aes(x = number, y = density),
            alpha = 3/4, size = 0, position = "identity") +
  geom_text(data = . %>% group_by(group) %>% slice(1),
            aes(x = x, y = y, label = group),
            family = "Times", parse = T,  hjust = 0) +
  scale_fill_manual(values = met.brewer("Hokusai3")[1:2]) +
  scale_color_manual(values = met.brewer("Hokusai3")[1:2]) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("mean number of tools") +
  theme_tidybayes() +
  theme(legend.position = "none")

Note the very large right tail for Normal (0, 10)

The second prior always expects some tools, which is more sensible for human societies

**How about *

Plot ~ Normal (0, 1) and ~ Normal (0, 10).

Also plot ~ Normal (3, 0.5) and ~ Normal (0, 0.2)

Code

set.seed(11)

# how many lines would you like?
n <- 100

# simulate and wrangle
tibble(i = 1:n,
       a = rnorm(n, mean = 3, sd = 0.5)) %>% 
  mutate(`beta%~%Normal(0*', '*10)`  = rnorm(n, mean = 0 , sd = 10),
         `beta%~%Normal(0*', '*0.2)` = rnorm(n, mean = 0 , sd = 0.2)) %>% 
  pivot_longer(contains("beta"),
               values_to = "b",
               names_to = "prior") %>% 
  expand(nesting(i, a, b, prior),
         x = seq(from = -2, to = 2, length.out = 100)) %>% 
  
  # plot
  ggplot(aes(x = x, y = exp(a + b * x), group = i)) +
  geom_line(size = 1/4, alpha = 2/3,
            color = met.brewer("Hokusai3")[4]) +
  labs(x = "log population (std)",
       y = "total tools") +
  coord_cartesian(ylim = c(0, 100)) +
  facet_wrap(~ prior, labeller = label_parsed) +
  theme_tidybayes()

The right prior is truly awful - counts shoot into space! The left one is reasonable though.

Now we’ve settled on some sensible priors, let’s fit models

Code

# fit intercept model

Oceanic_tools_1 <-
  brm(total_tools ~ 1,
    data = Oceanic_tools, family = poisson,
    prior = c(prior(normal(3, 0.5), class = Intercept)),
    iter = 4000, warmup = 2000, cores = 4, chains = 4,
    file = "fits/Oceanic_tools_1")

# Fit interaction model
  
Oceanic_tools_2 <-
  brm(total_tools ~ 1 + contact_id * log_pop_std,
    data = Oceanic_tools, family = poisson,
    prior = c(prior(normal(3, 0.5), class = Intercept),
              prior(normal(0, 0.2), class = b)),
    iter = 4000, warmup = 2000, cores = 4, chains = 4,
    file = "fits/Oceanic_tools_2")

Use LOO to compare using PSIS

Code

Oceanic_tools_1  <- add_criterion(Oceanic_tools_1, "loo")
Oceanic_tools_2 <- add_criterion(Oceanic_tools_2, "loo")

loo_compare(Oceanic_tools_1, Oceanic_tools_2, criterion = "loo") %>% print(simplify = F)

                elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic
Oceanic_tools_2   0.0       0.0   -41.3      5.6         6.0   2.0     82.7
Oceanic_tools_1 -29.6      17.1   -71.0     17.0         8.5   3.8    141.9
                se_looic
Oceanic_tools_2  11.3   
Oceanic_tools_1  34.1

Note the pareto k values that are being flagged. Outliers with a large influence on the posterior are present.

Code

loo(Oceanic_tools_2) %>% loo::pareto_k_table()

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     7     70.0%   782       
 (0.5, 0.7]   (ok)       0      0.0%   <NA>      
   (0.7, 1]   (bad)      2     20.0%   158       
   (1, Inf)   (very bad) 1     10.0%   81

Code

tibble(culture = Oceanic_tools$culture,
       k       = Oceanic_tools_2$criteria$loo$diagnostics$pareto_k) %>% 
  arrange(desc(k)) %>% 
  mutate_if(is.double, round, digits = 2)

# A tibble: 10 × 2
   culture        k
   <fct>      <dbl>
 1 Hawaii      1.04
 2 Tonga       0.79
 3 Yap         0.76
 4 Trobriand   0.47
 5 Malekula    0.41
 6 Manus       0.34
 7 Chuuk       0.3 
 8 Tikopia     0.28
 9 Santa Cruz  0.2 
10 Lau Fiji    0.19

See that Hawaii and Tonga are having large effects on the fit.

Plot the interaction model

Code

cultures <- c("Hawaii", "Tonga", "Trobriand", "Yap")

library(ggrepel)

nd <-
  distinct(Oceanic_tools, contact_id) %>% 
  expand(contact_id, 
         log_pop_std = seq(from = -4.5, to = 2.5, length.out = 100))
f <- 
  fitted(Oceanic_tools_2,
         newdata = nd,
         probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd)

p1 <-
  f %>%
  ggplot(aes(x = log_pop_std, group = contact_id, color = contact_id)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = contact_id),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(Oceanic_tools, Oceanic_tools_2$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  geom_text_repel(data = 
                    bind_cols(Oceanic_tools, Oceanic_tools_2$criteria$loo$diagnostics) %>% 
                    filter(culture %in% cultures) %>% 
                    mutate(label = str_c(culture, " (", round(pareto_k, digits = 2), ")")),
                  aes(y = total_tools, label = label), 
                  size = 3, seed = 11, color = "black", family = "Times") +
  labs(x = "log population (std)",
       y = "total tools") +
  coord_cartesian(xlim = range(Oceanic_tools_2$data$log_pop_std),
                  ylim = c(0, 80))

p2 <-
  f %>%
  mutate(population = exp((log_pop_std * sd(log(Oceanic_tools$population))) + mean(log(Oceanic_tools$population)))) %>% 

  ggplot(aes(x = population, group = contact_id, color = contact_id)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = contact_id),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(Oceanic_tools, Oceanic_tools_2$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
  ylab("total tools") +
  coord_cartesian(xlim = range(Oceanic_tools$population),
                  ylim = c(0, 80))

(p1 | p2) &
  scale_fill_manual(values = c(met.brewer("Hokusai3")[4], met.brewer("Hokusai3")[7])) &
  scale_color_manual(values = c(met.brewer("Hokusai3")[4], met.brewer("Hokusai3")[7])) &
  scale_size(range = c(2, 5)) &
  theme_minimal() &
  theme(legend.position = "none")

Hawaii in particular is having a huge effect. Note also that the high contact islands are not represented past a certain population size, but the model continues to predict. This has huge uncertainty. Another quirk of the model is that it predicts a non-zero number of tools for islands with 0 people. This is nonsense.

Two immediate ways to improve model

Use a model robust to outliers: for this distribution type we can use the gamma-poisson (negative binomial)
Use a more principled scientific model

We can model this differently:

We can model the change per unit time.

$\Delta T = \alpha P^\beta - \gamma T$

Where

T is the change in tools
is the innovation rate
establishes diminishing returns
then subtract T, the rate of loss

This is not a regression model.

If we want to find the expected number of tools for a certain population size, we simply solve for T.

$T = \frac{\alpha_{c}P^{\beta C}}{\gamma}$

Sub this in for in our poisson model!

Lets fit the innovation/loss model

All parameters are rates, so they must be positive.

Two options here:

use exp()
use appropriate prior to constrain it to be positive

Code

b11.11 <-
  brm(data = Oceanic_tools, 
      family = poisson(link = "identity"),
      bf(total_tools ~ exp(a) * population^b / g,
         a + b ~ 0 + contact_id,
         g ~ 1,
         nl = TRUE),
      prior = c(prior(normal(1, 1), nlpar = a),
                prior(exponential(1), nlpar = b, lb = 0),
                prior(exponential(1), nlpar = g, lb = 0)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      control = list(adapt_delta = .95),
      file = "fits/b11.11") 

b11.11 <- add_criterion(b11.11, criterion = "loo", moment_match = T)

b11.11

 Family: poisson 
  Links: mu = identity 
Formula: total_tools ~ exp(a) * population^b/g 
         a ~ 0 + contact_id
         b ~ 0 + contact_id
         g ~ 1
   Data: Oceanic_tools (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
a_contact_id1     0.88      0.69    -0.57     2.15 1.00     1363     1542
a_contact_id2     0.97      0.85    -0.73     2.61 1.00     1566     1601
b_contact_id1     0.26      0.03     0.19     0.33 1.00     1960     1489
b_contact_id2     0.28      0.10     0.08     0.49 1.00     1096      909
g_Intercept       1.14      0.75     0.21     3.08 1.00     1198     1299

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The summary gives hard to interpret parameter values. So plot.

Code

# for the annotation
text <-
  distinct(Oceanic_tools, contact_id) %>% 
  mutate(contact_id = if_else(contact_id == 1, "low", "high"),
         population  = c(210000, 72500),
         total_tools = c(59, 68),
         label       = str_c(contact_id, " contact"))

# redefine the new data
nd <-
  distinct(Oceanic_tools, contact_id) %>% 
  expand(contact_id, 
         population = seq(from = 0, to = 300000, length.out = 100))

# compute the poster predictions for lambda
fitted(b11.11,
       newdata = nd,
       probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  
  # plot!
  ggplot(aes(x = population, group = contact_id, color = contact_id)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = contact_id),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(Oceanic_tools, b11.11$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  geom_text(data = text,
            aes(y = total_tools, label = label),
            family = "serif") +
  scale_fill_manual(values = met.brewer("Hokusai3")[1:4]) +
  scale_color_manual(values = met.brewer("Hokusai3")[1:4]) +
  scale_size(range = c(2, 5)) +
  scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
  ylab("total tools") +
  coord_cartesian(xlim = range(Oceanic_tools$population),
                  ylim = range(Oceanic_tools$total_tools)) +
  theme_minimal() +
  theme(legend.position = "none")

Note that the innovation/loss model stops the different contact lines from crossing.

$~$