Although (multiple) linear [linear relationshipp] regression models are extremely useful they are not the only biological relationship between 2 variables. A linear regression (linear in the relationship between the variables, not linear in the parameters) implies that as the value of X (the independent varible) increases so Y increases by an amount equal to the regression coefficient (bi). However, many biological relationships are not completely linear and often a curvilinear, quadratic relationship can exist, with an intermediate optimum (which may be a maximum or a minimum depending upon the relationship). For example, if we look at the corn yield per hectare and its relationship with the amount of fertiliser used then we will likely find that, initially, as we use more fertiliser that the corn yield will increase. However, we know that this increase in yield with increasing fertiliser cannot continue ad infinitum. The corn yield will probably reach a plateau, where increasing fertiliser use does not cause any increase in yield, and may even cause a decline. This type of relationship is a curvilinear relationship, perhaps adequately described by a quadratic relationship (perhaps not!). If a quadratic relationship is a reasonable representation then there will be an intermediate optimimum (maximum). Another example, this time closer to home (sic). If we look at the mortality rate of newborn babies and the relationship with birthweight we see a curvilinear relationship; babies with a very low birthweight have a high proability (risk) of death. Babies with an intermediate (average) birthweight have a low probability of death, and babies with a high birthweight again, have a higher risk (probability) of death. Thus, a quadratic relationship between risk of death and birthweight seems to exist, with an intermediate (minimum) optimum birthweight at which the risk of death is minimized.
How do we handle this quadratic relationship in our model and analysis? Well, it's not too difficult! We can include a term for the square (quadratic) of the independent variable as an additional regression covariate:
This will give us linear and quadratic regressions of Y on X.
We could take our data and square each observation of X1 and write down the square and enter that as a new column (variable) and proceed just as for a multiple regression problem. However, we might make [careless] arithmetic mistakes, and it will take more time; let's let the computer do the work, that is what they are there for!
Consider the following experiment: a group of 50 cows were fed diets with various levels of feed intake (50 to 140 lbs of haylage) with various energy densities (0.8 to 1.6). The milk yield for the complete lactation was measured (in kg.) The data are:
Cow | Feed Intake | Energy Density | Milk Yield |
---|---|---|---|
1 | 50 | 0.8 | 5731.05 |
2 | 50 | 1.0 | 4607.40 |
3 | 50 | 1.2 | 5169.25 |
4 | 50 | 1.4 | 6345.16 |
5 | 50 | 1.6 | 6477.83 |
6 | 60 | 0.8 | 4970.22 |
7 | 60 | 1.0 | 5263.30 |
8 | 60 | 1.2 | 5414.44 |
9 | 60 | 1.4 | 7102.82 |
10 | 60 | 1.6 | 6670.46 |
11 | 70 | 0.8 | 6371.27 |
12 | 70 | 1.0 | 5594.80 |
13 | 70 | 1.2 | 6033.55 |
14 | 70 | 1.4 | 7248.72 |
15 | 70 | 1.6 | 7288.52 |
16 | 80 | 0.8 | 5499.63 |
17 | 80 | 1.0 | 6644.66 |
18 | 80 | 1.2 | 6880.00 |
19 | 80 | 1.4 | 7542.48 |
20 | 80 | 1.6 | 7916.68 |
21 | 90 | 0.8 | 6758.12 |
22 | 90 | 1.0 | 7547.07 |
23 | 90 | 1.2 | 7855.26 |
24 | 90 | 1.4 | 7879.89 |
25 | 90 | 1.6 | 7938.86 |
26 | 100 | 0.8 | 6371.87 |
27 | 100 | 1.0 | 6996.44 |
28 | 100 | 1.2 | 7095.97 |
29 | 100 | 1.4 | 8360.18 |
30 | 100 | 1.6 | 8206.27 |
31 | 110 | 0.8 | 6750.66 |
32 | 110 | 1.0 | 7567.50 |
33 | 110 | 1.2 | 8222.51 |
34 | 110 | 1.4 | 8336.00 |
35 | 110 | 1.6 | 8967.15 |
36 | 120 | 0.8 | 6575.70 |
37 | 120 | 1.0 | 8261.29 |
38 | 120 | 1.2 | 7488.05 |
39 | 120 | 1.4 | 9299.34 |
40 | 120 | 1.6 | 8629.58 |
41 | 130 | 0.8 | 7165.49 |
42 | 130 | 1.0 | 7047.87 |
43 | 130 | 1.2 | 7764.65 |
44 | 130 | 1.4 | 8740.82 |
45 | 130 | 1.6 | 9101.40 |
46 | 140 | 0.8 | 7608.81 |
47 | 140 | 1.0 | 7843.19 |
48 | 140 | 1.2 | 8400.67 |
49 | 140 | 1.4 | 9421.99 |
50 | 140 | 1.6 | 9010.69 |
We could use the following SAS code to read the data in and fit a multiple regression model with ed, fi and fi2
data quad1; input cow fi ed yield; cards; 1 50 0.8 5731.05 2 50 1.0 4607.40 3 50 1.2 5169.25 4 50 1.4 6345.16 . . . 48 140 1.2 8400.67 49 140 1.4 9421.99 50 140 1.6 9010.69 ; proc glm data=quad1; model my = ed fi fi*fi; run;
Note how we have included the term fi*fi which is fi2!
data, SAS data step code and PROC GLM statements
We obtain the following SAS output:
The SAS System |
The GLM Procedure |
Number of observations | 50 |
The SAS System |
The GLM Procedure |
Dependent Variable: Yield |
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 3 | 61129561.86 | 20376520.62 | 103.21 | <.0001 |
Error | 46 | 9081842.78 | 197431.36 | ||
Corrected Total | 49 | 70211404.64 |
R-Square | Coeff Var | Root MSE | Yield Mean |
0.870650 | 6.137434 | 444.3325 | 7239.711 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
ed | 1 | 20896893.40 | 20896893.40 | 105.84 | <.0001 |
fi | 1 | 38518744.03 | 38518744.03 | 195.10 | <.0001 |
fi*fi | 1 | 1713924.43 | 1713924.43 | 8.68 | 0.0050 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
ed | 1 | 20896893.40 | 20896893.40 | 105.84 | <.0001 |
fi | 1 | 4481068.57 | 4481068.57 | 22.70 | <.0001 |
fi*fi | 1 | 1713924.43 | 1713924.43 | 8.68 | 0.0050 |
Parameter | Estimate | Standard Error | t Value | Pr > |t| |
Intercept | -495.414245 | 788.0807136 | -0.63 | 0.5327 |
ed | 2285.656000 | 222.1662467 | 10.29 | <.0001 |
fi | 78.969322 | 16.5758454 | 4.76 | <.0001 |
fi*fi | -0.254797 | 0.0864781 | -2.95 | 0.0050 |
What can we see from this analysis? Well we see that the Model over and above the Mean, R(ed, fi, fi*fi | µ ), accounts for a statistically significant amount of the variation, F-ratio = 103.2. We can also see that the Marginal effect of Energy Density, R(ed | µ fi, fi*fi), is statistically significant (F-ratio = 105.84), as is the Marginal effect of fi*fi (the quadratic effect of Feed Intake), F-ratio = 8.68. We shall not test the statistical significance of the linear regression component for Feed Intake, since if the quadratic effect is significant then we are going to include the linear regression effect in the model!!! Hence testing its statistical significance is a nonsense.
What is the optimum feed intake? Well,let us look at the prediction equation that we have obtained.
We can differentiate this with respect to feed intake, equate to Zero
and solve. Obvious is it not? It almost takes us back to high school,
solving for maximums and minimums. Bet you never thought that you'd ever
have any use for the calculus that you learnt!
What do we get?
Note that the estimated optimum ( ~ 155kg) actually lies outside the range of our data, hence we have a curve which is reaching a maximum, but our data does not in fact encompass the maximum. Since extrapolating outside the data range is somewhat speculative we should be quite cautious about these results. We would probably want to repeat the experiment, feeding increased amounts of feed to check out the prediction. It would be most desirable to have feed intakes (X values) spanning the area of the optimum, so that we are not extrapolating.