Quadratic Regressions

Quadratic Regressions (Are we straight?)

Although (multiple) linear [linear relationshipp] regression models are extremely useful they are not the only biological relationship between 2 variables. A linear regression (linear in the relationship between the variables, not linear in the parameters) implies that as the value of X (the independent varible) increases so Y increases by an amount equal to the regression coefficient (b_i). However, many biological relationships are not completely linear and often a curvilinear, quadratic relationship can exist, with an intermediate optimum (which may be a maximum or a minimum depending upon the relationship). For example, if we look at the corn yield per hectare and its relationship with the amount of fertiliser used then we will likely find that, initially, as we use more fertiliser that the corn yield will increase. However, we know that this increase in yield with increasing fertiliser cannot continue ad infinitum. The corn yield will probably reach a plateau, where increasing fertiliser use does not cause any increase in yield, and may even cause a decline. This type of relationship is a curvilinear relationship, perhaps adequately described by a quadratic relationship (perhaps not!). If a quadratic relationship is a reasonable representation then there will be an intermediate optimimum (maximum). Another example, this time closer to home (sic). If we look at the mortality rate of newborn babies and the relationship with birthweight we see a curvilinear relationship; babies with a very low birthweight have a high proability (risk) of death. Babies with an intermediate (average) birthweight have a low probability of death, and babies with a high birthweight again, have a higher risk (probability) of death. Thus, a quadratic relationship between risk of death and birthweight seems to exist, with an intermediate (minimum) optimum birthweight at which the risk of death is minimized.

How do we handle this quadratic relationship in our model and analysis? Well, it's not too difficult! We can include a term for the square (quadratic) of the independent variable as an additional regression covariate:

Y_i = µ + b₁ X_i + b₂ X_i² + e_i

This will give us linear and quadratic regressions of Y on X.

We could take our data and square each observation of X₁ and write down the square and enter that as a new column (variable) and proceed just as for a multiple regression problem. However, we might make [careless] arithmetic mistakes, and it will take more time; let's let the computer do the work, that is what they are there for!

Consider the following experiment: a group of 50 cows were fed diets with various levels of feed intake (50 to 140 lbs of haylage) with various energy densities (0.8 to 1.6). The milk yield for the complete lactation was measured (in kg.) The data are:

Cow	Feed Intake	Energy Density	Milk Yield
1	50	0.8	5731.05
2	50	1.0	4607.40
3	50	1.2	5169.25
4	50	1.4	6345.16
5	50	1.6	6477.83
6	60	0.8	4970.22
7	60	1.0	5263.30
8	60	1.2	5414.44
9	60	1.4	7102.82
10	60	1.6	6670.46
11	70	0.8	6371.27
12	70	1.0	5594.80
13	70	1.2	6033.55
14	70	1.4	7248.72
15	70	1.6	7288.52
16	80	0.8	5499.63
17	80	1.0	6644.66
18	80	1.2	6880.00
19	80	1.4	7542.48
20	80	1.6	7916.68
21	90	0.8	6758.12
22	90	1.0	7547.07
23	90	1.2	7855.26
24	90	1.4	7879.89
25	90	1.6	7938.86
26	100	0.8	6371.87
27	100	1.0	6996.44
28	100	1.2	7095.97
29	100	1.4	8360.18
30	100	1.6	8206.27
31	110	0.8	6750.66
32	110	1.0	7567.50
33	110	1.2	8222.51
34	110	1.4	8336.00
35	110	1.6	8967.15
36	120	0.8	6575.70
37	120	1.0	8261.29
38	120	1.2	7488.05
39	120	1.4	9299.34
40	120	1.6	8629.58
41	130	0.8	7165.49
42	130	1.0	7047.87
43	130	1.2	7764.65
44	130	1.4	8740.82
45	130	1.6	9101.40
46	140	0.8	7608.81
47	140	1.0	7843.19
48	140	1.2	8400.67
49	140	1.4	9421.99
50	140	1.6	9010.69

We could use the following SAS code to read the data in and fit a multiple regression model with ed, fi and fi²


data quad1;
input  cow  fi  ed   yield;
cards;
1  50    0.8    5731.05
2  50    1.0    4607.40
3  50    1.2    5169.25
4  50    1.4    6345.16
 .
 .
 .

48  140    1.2    8400.67
49  140    1.4    9421.99
50  140    1.6    9010.69
;


proc glm data=quad1;
model my = ed fi fi*fi;
run;

Note how we have included the term fi*fi which is fi²!

data, SAS data step code and PROC GLM statements

We obtain the following SAS output:

The SAS System

The GLM Procedure

Number of observations	50

The SAS System

The GLM Procedure

Dependent Variable: Yield

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	61129561.86	20376520.62	103.21	<.0001
Error	46	9081842.78	197431.36
Corrected Total	49	70211404.64

R-Square	Coeff Var	Root MSE	Yield Mean
0.870650	6.137434	444.3325	7239.711

Source	DF	Type I SS	Mean Square	F Value	Pr > F
ed	1	20896893.40	20896893.40	105.84	<.0001
fi	1	38518744.03	38518744.03	195.10	<.0001
*fifi**	1	1713924.43	1713924.43	8.68	0.0050

Source	DF	Type III SS	Mean Square	F Value	Pr > F
ed	1	20896893.40	20896893.40	105.84	<.0001
fi	1	4481068.57	4481068.57	22.70	<.0001
*fifi**	1	1713924.43	1713924.43	8.68	0.0050

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	-495.414245	788.0807136	-0.63	0.5327
ed	2285.656000	222.1662467	10.29	<.0001
fi	78.969322	16.5758454	4.76	<.0001
*fifi**	-0.254797	0.0864781	-2.95	0.0050

What can we see from this analysis? Well we see that the Model over and above the Mean, R(ed, fi, fi*fi | µ ), accounts for a statistically significant amount of the variation, F-ratio = 103.2. We can also see that the Marginal effect of Energy Density, R(ed | µ fi, fi*fi), is statistically significant (F-ratio = 105.84), as is the Marginal effect of fi*fi (the quadratic effect of Feed Intake), F-ratio = 8.68. We shall not test the statistical significance of the linear regression component for Feed Intake, since if the quadratic effect is significant then we are going to include the linear regression effect in the model!!! Hence testing its statistical significance is a nonsense.

What is the optimum feed intake? Well,let us look at the prediction equation that we have obtained.

Y_i = -495.41 + 2285.656*ed_i + 78.969*fi_i - 0.2548*fi_i²

We can differentiate this with respect to feed intake, equate to Zero and solve. Obvious is it not? It almost takes us back to high school, solving for maximums and minimums. Bet you never thought that you'd ever have any use for the calculus that you learnt! What do we get?

¶ (Y)/ ¶ fi = 78.969322 - 2 * 0.254797 * fi

78.969322 - 2 * 0.254797 * fi = 0

fi_opt = 154.965

Note that the estimated optimum ( ~ 155kg) actually lies outside the range of our data, hence we have a curve which is reaching a maximum, but our data does not in fact encompass the maximum. Since extrapolating outside the data range is somewhat speculative we should be quite cautious about these results. We would probably want to repeat the experiment, feeding increased amounts of feed to check out the prediction. It would be most desirable to have feed intakes (X values) spanning the area of the optimum, so that we are not extrapolating.