Simple Linear Regression - Prediction

Research Objective

Research Question: What is the average student height for students whose mother is 64 inches tall?

How would you figure this out?

Prediction in Regression

Research Question: What is the average student height for students whose mother is 64 inches tall?

Answer: Use the best fit regression line to tell you the answer.

\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x = 35.653 + 0.503\times 64 = 67.845\)

Practice 6.4 Question 1

Is our prediction of \(\hat{y}\) = 67.845 of the average height for students whose mother is 64 inches a sample statistic or population parameter?

  1. Sample statistic
  2. Population Parameter

Practice 6.4 Question 1 Answer

Is our prediction \(\hat{y}\) = 67.845 of the average height for students whose mother is 64 inches a sample statistic or population parameter?

  1. Sample statistic
  2. Population Parameter

Lets build a confidence interval for the population parameter!

Confidence Intervals for Averages

Using similar principles as we have used in the past to build confidence intervals: \[ \hat{y} \pm t^\star \hat{\sigma}\sqrt{\frac{1}{n}+\frac{(x-\bar{x})}{\sum_{i=1}^n(x_i - \bar{x})^2}} \] Is a confidence interval for the average value of \(y\) given an \(x\) (the population average student height for 64 inch tall mothers) where the value of \(t^\star\) is determined by the confidence level.

For our analysis, this comes out to be (67.662, 68.054) for a 95% interval.

Notes:

  1. Don’t worry about the formula (computer will calculate this for you).
  2. Interpetation: We are 95% confident that the average height of all students whose mothers are 64 inches tall is between 67.662 and 68.054.

Prediction in Regression

Research Question: Shaylee’s mom is 64 inches tall, what will her height be?

Thought Questions:

  1. Is this the same question as above? If not, what is the difference?

    • It’s not the same. One is asking about an average while one is asking about a specific person.
    • The “average” is the line while specific people are the “dots”.

Prediction in Regression

Research Question: Shaylee’s mom is 64 inches tall, what will her height be?

Thought Questions:

  1. Should our point prediction (1 number prediction) be the same or different?

    • The point prediction should be the same because “dots” could either fall above or below the line. In this case, we still think Shaylee’s height will be 67.845.

Prediction in Regression

Research Question: Shaylee’s mom is 64 inches tall, what will her height be?

  1. Should our interval for the prediction be the same or different? Why or why not?

    • It should be wider because heights vary a lot from person to person

Prediction Intervals for Individuals

Using similar principles as we have used in the past to build confidence intervals: \[ \hat{y} \pm t^\star \hat{\sigma}\sqrt{1 + \frac{1}{n}+\frac{(x-\bar{x})}{\sum_{i=1}^n(x_i - \bar{x})^2}} \] is a prediction interval for the value of \(y\) given an \(x\) (for example, Shaylee’s height if her mom is 64 inches tall) where the value of \(t^\star\) is determined by the confidence level.

For our analysis, this comes out to be (60.449, 75.268) for a 95% interval.

Notes:

  1. Don’t worry about the formula (computer will calculate this for you).
  2. Interpetation is similar: We are 95% confident that Shaylee’s height, given her mom is 64 inches tall, should be between 60.449 and 75.268.

Prediction vs Confidence Intervals

Confidence interval for prediction: An interval estimate for the average of \(y\) given an \(x\).

Prediction interval for prediction: An interval estimate for the value of a single \(y\) given an \(x\).

Prediction intervals are ALWAYS wider than confidence intervals. Why?

  • There is more variability from student to student than with the average heights for students.

Using the Analysis Tool

All previous steps in the tool are the same as covered in previous lecture notes:

Practice 6.4 Question 2

Luis thinks he wants to move to Panama City Florida for the beaches (30.1588 north latitude). What is the UPPER bound of a 90% interval for the mortality rate for this state?

Practice 6.4 Question 2 Answer

Luis thinks he wants to move to Panama City Florida for the beaches (30.1588 north latitude). What is the UPPER bound of a 90% interval for the mortality rate for this state?

Nuances of Predictions

Research Question: Lucy’s mom is 82 inches tall, what will her height be?

Answer:

  • Don’t do the prediction because its outside of the data range! This is referred to as extrapolation.

Nuances of Predictions

  1. Extrapolation - trying to predict outside of the range of the data.

Nuances of Predictions

  1. How do we know if our predictions are any good? For example, how do we know if our prediction for Shaylee’s height was good or bad?

    • Issue: To evaluate how well we do at predicting, we essentially need to know the true answer of the thing we are predicting for.
    • Solution: Cross-validation

Principles of K-Fold CV

  • Purpose: Assess how well your model does at predicting
  • General Idea: Fit your model to part of your data then see how well your model predicts the remainder of your data

Using the Analysis Tool

Nuances of Cross Validation

  1. Randomly split the data into folds \(\rightarrow\) every run of cross-validation will give slightly different results
  2. Lots of performance metrics but most common is root mean square error \[ \text{RMSE} = \sqrt{\frac{1}{n_{\text{validation}}}\sum_{i=1}^{n_{\text{validation}}}(y_i - \hat{y}_i)^2} \] where \(y_i\) is an observation in the validation set and \(\hat{y}_i\) is the corresponding prediction.
  3. The intuitive interpretation of RMSE is the average error across our predictions.
  4. What constitutes a “small” RMSE is relative to the problem.

Practice 6.4 Question 3

For the melanoma example, suppose we got an RMSE of 19. Which of the following is the correct interpretation of this value?

  1. The difference between our predicted mortality and the actual mortality for Florida is 19.
  2. The worst prediction we made in cross-validation was 19 away from the actual mortality.
  3. The average difference between our predictions and the actual mortality rates as 19.
  4. The average residual in our regression analysis was 19.

Practice 6.4 Question 3 Answer

For the melanoma example, suppose we got an RMSE of 19. Which of the following is the correct interpretation of this value?

  1. The difference between our predicted mortality and the actual mortality for Florida is 19.
  2. The worst prediction we made in cross-validation was 19 away from the actual mortality.
  3. The average difference between our predictions and the actual mortality rates as 19.
  4. The average residual in our regression analysis was 19.

Additional Prediction Practice

Measuring possum head size can be difficult. However, measuring total possum length is easier. What is the relationship between possum length and head size? Use a simple linear regression model (and the course app) to answer the following questions:

  1. Sydney found a huge 96 cm possum. What is your predicted head length for this possum?

    • 95% prediction interval is (92.431, 102.986).
  2. Sydney found a huge 96 cm possum. What is the average head length for possums of this size?

    • 95% confidence interval is (96.545, 98.872).
  3. Sydney found a baby 70 cm possum. What is your predicted head length for this possum?

    • EXTRAPOLATION!
  4. Is your model good or bad at predicting possum head sizes?

    • The RMSE of a 104 fold CV is 2.0132492.

Key Terminology

  • Confidence Intervals for Averages
  • Extrapoloation
  • Prediction Intervals for Individuals
  • Cross validation
  • Root mean square error (RMSE)