Quadratic Regressions (Are we straight?)

Although (multiple) linear [linear relationshipp] regression models are extremely useful they are not the only biological relationship between 2 variables. A linear regression (linear in the relationship between the variables, not linear in the parameters) implies that as the value of X (the independent varible) increases so Y increases by an amount equal to the regression coefficient (bi). However, many biological relationships are not completely linear and often a curvilinear, quadratic relationship can exist, with an intermediate optimum (which may be a maximum or a minimum depending upon the relationship). For example, if we look at the corn yield per hectare and its relationship with the amount of fertiliser used then we will likely find that, initially, as we use more fertiliser that the corn yield will increase. However, we know that this increase in yield with increasing fertiliser cannot continue ad infinitum. The corn yield will probably reach a plateau, where increasing fertiliser use does not cause any increase in yield, and may even cause a decline. This type of relationship is a curvilinear relationship, perhaps adequately described by a quadratic relationship (perhaps not!). If a quadratic relationship is a reasonable representation then there will be an intermediate optimimum (maximum). Another example, this time closer to home (sic). If we look at the mortality rate of newborn babies and the relationship with birthweight we see a curvilinear relationship; babies with a very low birthweight have a high proability (risk) of death. Babies with an intermediate (average) birthweight have a low probability of death, and babies with a high birthweight again, have a higher risk (probability) of death. Thus, a quadratic relationship between risk of death and birthweight seems to exist, with an intermediate (minimum) optimum birthweight at which the risk of death is minimized.

How do we handle this quadratic relationship in our model and analysis? Well, it's not too difficult! We can include a term for the square (quadratic) of the independent variable as an additional regression covariate:

Yi = µ + b1 Xi + b2 Xi2 + ei

This will give us linear and quadratic regressions of Y on X.

We could take our data and square each observation of X1 and write down the square and enter that as a new column (variable) and proceed just as for a multiple regression problem. However, we might make [careless] arithmetic mistakes, and it will take more time; let's let the computer do the work, that is what they are there for!

Consider the following experiment: a group of 50 cows were fed diets with various levels of feed intake (50 to 140 lbs of haylage) with various energy densities (0.8 to 1.6). The milk yield for the complete lactation was measured (in kg.) The data are:


Cow Feed Intake Energy Density Milk Yield
1 50 0.8 5731.05
2 50 1.0 4607.40
3 50 1.2 5169.25
4 50 1.4 6345.16
5 50 1.6 6477.83
6 60 0.8 4970.22
7 60 1.0 5263.30
8 60 1.2 5414.44
9 60 1.4 7102.82
10 60 1.6 6670.46
11 70 0.8 6371.27
12 70 1.0 5594.80
13 70 1.2 6033.55
14 70 1.4 7248.72
15 70 1.6 7288.52
16 80 0.8 5499.63
17 80 1.0 6644.66
18 80 1.2 6880.00
19 80 1.4 7542.48
20 80 1.6 7916.68
21 90 0.8 6758.12
22 90 1.0 7547.07
23 90 1.2 7855.26
24 90 1.4 7879.89
25 90 1.6 7938.86
26 100 0.8 6371.87
27 100 1.0 6996.44
28 100 1.2 7095.97
29 100 1.4 8360.18
30 100 1.6 8206.27
31 110 0.8 6750.66
32 110 1.0 7567.50
33 110 1.2 8222.51
34 110 1.4 8336.00
35 110 1.6 8967.15
36 120 0.8 6575.70
37 120 1.0 8261.29
38 120 1.2 7488.05
39 120 1.4 9299.34
40 120 1.6 8629.58
41 130 0.8 7165.49
42 130 1.0 7047.87
43 130 1.2 7764.65
44 130 1.4 8740.82
45 130 1.6 9101.40
46 140 0.8 7608.81
47 140 1.0 7843.19
48 140 1.2 8400.67
49 140 1.4 9421.99
50 140 1.6 9010.69


We could use the following SAS code to read the data in and fit a multiple regression model with ed, fi and fi2


data quad1;
input  cow  fi  ed   yield;
cards;
1  50    0.8    5731.05
2  50    1.0    4607.40
3  50    1.2    5169.25
4  50    1.4    6345.16
 .
 .
 .

48  140    1.2    8400.67
49  140    1.4    9421.99
50  140    1.6    9010.69
;


proc glm data=quad1;
model my = ed fi fi*fi;
run;



Note how we have included the term fi*fi which is fi2!

data, SAS data step code and PROC GLM statements


We obtain the following SAS output:



 
The SAS System

The GLM Procedure

Number of observations 50

 


 
The SAS System

The GLM Procedure
Dependent Variable: Yield

Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 61129561.86 20376520.62 103.21 <.0001
Error 46 9081842.78 197431.36    
Corrected Total 49 70211404.64      
 
R-Square Coeff Var Root MSE Yield Mean
0.870650 6.137434 444.3325 7239.711
 
Source DF Type I SS Mean Square F Value Pr > F
ed 1 20896893.40 20896893.40 105.84 <.0001
fi 1 38518744.03 38518744.03 195.10 <.0001
fi*fi 1 1713924.43 1713924.43 8.68 0.0050
 
Source DF Type III SS Mean Square F Value Pr > F
ed 1 20896893.40 20896893.40 105.84 <.0001
fi 1 4481068.57 4481068.57 22.70 <.0001
fi*fi 1 1713924.43 1713924.43 8.68 0.0050
 
Parameter Estimate Standard Error t Value Pr > |t|
Intercept -495.414245 788.0807136 -0.63 0.5327
ed 2285.656000 222.1662467 10.29 <.0001
fi 78.969322 16.5758454 4.76 <.0001
fi*fi -0.254797 0.0864781 -2.95 0.0050


What can we see from this analysis? Well we see that the Model over and above the Mean, R(ed, fi, fi*fi | µ ), accounts for a statistically significant amount of the variation, F-ratio = 103.2. We can also see that the Marginal effect of Energy Density, R(ed | µ fi, fi*fi), is statistically significant (F-ratio = 105.84), as is the Marginal effect of fi*fi (the quadratic effect of Feed Intake), F-ratio = 8.68. We shall not test the statistical significance of the linear regression component for Feed Intake, since if the quadratic effect is significant then we are going to include the linear regression effect in the model!!! Hence testing its statistical significance is a nonsense.

What is the optimum feed intake? Well,let us look at the prediction equation that we have obtained.

Yi = -495.41 + 2285.656*edi + 78.969*fii - 0.2548*fii2

We can differentiate this with respect to feed intake, equate to Zero and solve. Obvious is it not? It almost takes us back to high school, solving for maximums and minimums. Bet you never thought that you'd ever have any use for the calculus that you learnt! What do we get?

(Y)/ fi = 78.969322 - 2 * 0.254797 * fi

78.969322 - 2 * 0.254797 * fi = 0

fiopt = 154.965

Note that the estimated optimum ( ~ 155kg) actually lies outside the range of our data, hence we have a curve which is reaching a maximum, but our data does not in fact encompass the maximum. Since extrapolating outside the data range is somewhat speculative we should be quite cautious about these results. We would probably want to repeat the experiment, feeding increased amounts of feed to check out the prediction. It would be most desirable to have feed intakes (X values) spanning the area of the optimum, so that we are not extrapolating.


R.I. Cue ©
Department of Animal Science, McGill University
last updated : 2010 April 28