## Introduction

In this post, we study a seemingly easy question; what is a 95% prediction interval in simple linear regression? We assume that our data comes from \[ y_i = 1 + 2x_i + \epsilon_i \] where \(\epsilon_i\) are iid normal with mean 0 and standard deviation 3.

## R Code for Prediction

We can build a linear model and prediction intervals using the following R code:

```
set.seed(4172018)
x <- runif(30, 0, 10)
y <- 1 + 2 * x + rnorm(30, 0, 3)
mod <- lm(y~x)
predict(mod, newdata = data.frame(x = 5), interval = "pre")
```

```
## fit lwr upr
## 1 12.03934 6.313055 17.76562
```

We see that we have an estimate of 12 (true value: \(1 + 2 \times 5\)), a lower prediction bound of 6.3 and an upper of 17.8. What percentage of future observations will fall in this interval?

```
sim_dat <- replicate(10000, {
new_point <- 1 + 2 * 5 + rnorm(1, 0, 3)
new_point > 6.313055 && new_point < 17.76562
})
mean(sim_dat)
```

`## [1] 0.9281`

We see that 9281 out of 10,000 future data points fall in this 95% prediction interval. Since we know the true way of generating data, we can compute the exact probability that future data will fall in this interval via:

`pnorm(17.76562 , 11, 3) - pnorm(6.313055, 11, 3)`

`## [1] 0.9288329`

What’s going on? How does constructing a 95% prediction interval lead to an interval that collects 92.88% of future data?

## Sampling Distribution

To understand this better, let’s estimate (via simulation) the sampling distribution of the probability that a prediction interval will contain future values.

```
sampling_dist <- replicate(1000, {
x <- runif(30, 0, 10)
y <- 1 + 2 * x + rnorm(30, 0, 3)
mod <- lm(y~x)
pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre")
pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3)
})
hist(sampling_dist)
```

`min(sampling_dist) #Smallest percentage of future values in prediction interval`

`## [1] 0.7742589`

`mean(sampling_dist) #Expected percentage of future values in prediction interval`

`## [1] 0.9506352`

`mean(sampling_dist < 0.9) #percentage of times that prediction interval predicts less than 90% of future values`

`## [1] 0.084`

`density(sampling_dist)$x[which.max(density(sampling_dist)$y)] #Most likely percentage of future values contained in prediction interval`

`## [1] 0.9673389`

The histogram of attained probabilities that a prediction interval will contain future data explains what is going on. The probabilities have *expected value* of 0.95, but the distribution is skew. The most likely outcome seems to be that the prediction interval contain about 97% of future data, but there is about a 8% chance that it will contain less than 90% of future data. The minimum amount of future data contained (in this sample size of 1000) is 77%!

## Sample Size

I have heard people say that the benefit of having more samples is that the prediction intervals get narrower. That is true, but also true is that a 95% prediction interval is also much more likely to contain about 95% of future values! Let’s see what happens when we have 300 and 3000 data points, rather than 30:

## Table of Standard Deviations

Finally, we estimate the standard deviation of the sampling distribution of the probability that a future value will be in a prediction interval for various sample sizes.

```
sample_sizes <- c(10, 20, 50, 100, 200, 500, 1000, 2000, 5000)
sds <- sapply(sample_sizes, function(n) {
sd(replicate(1000, {
x <- runif(n, 0, 10)
y <- 1 + 2 * x + rnorm(n, 0, 3)
mod <- lm(y~x)
pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre")
pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3)
}))
})
knitr::kable(data.frame(`Sample Size` = sample_sizes, `Standard Dev` = sds))
```

Sample.Size | Standard.Dev |
---|---|

10 | 0.0705585 |

20 | 0.0386866 |

50 | 0.0242093 |

100 | 0.0162205 |

200 | 0.0117437 |

500 | 0.0071819 |

1000 | 0.0050827 |

2000 | 0.0037369 |

5000 | 0.0022930 |