Classification or Regression, that is the question

This is a first attempt at an R version of the SAS analysis comparing a classification effects model vs a simpler regession model, using either orthogonal polynomials or an over-parameterised model.

Data file with recorded observations

R code to read in the data and carry out the analyses

As with the analyses done using SAS, we have an experiment where 15 cows were randonly assigned to 5 different diets (basal diet + 20, 30, 40, 50 or 60 units of a protein supplement). We can analyse these data as a straight-forward CRD. We can also ask the question; "Are the treatment/diet effects qantitative in effect?", since the treatments correspond to actual amounts of a protein supplement. For this we could use a multiple regression model, regressing on the amount of protein supplement offered. Which model is best, most parsimonious, and/or appropriate? This section is about just that question; trying to determine (statistically) which is the more appropriate model to use. The classification model is more complex, it requires more parameters than the simpler regression model (fewer parameters, i.e. more parsimonious). Is the extra complexity worth it, or is te simpler model adequate?

Classification model

We can fit a classification model, in this case, since the only factor is the diet treatments, it is a CRD, Completely Randomized Design, but in more complicated models with more than one factor it would be a model where Diet was a classification effect, in R terms a 'factor'. Below is the code to fit the CRD, to provide an ANOVA and to produce the least squares means for the treatment effects (trt, = Diet supplement).

ds$trt <- as.factor(ds$dmi)

# model 1, fixed effects, dmi (trt) as a classification effect
lm1 <-lm(my~trt,data=ds)
lm1
print(lm1,digits=7)
summary(lm1)
anova(lm1)

lsmeans(lm1,pairwise~trt)
lsmeans(lm1,pairwise~trt,adjust="Scheffe")

We first specify that dmi is to be considered as a factor (classification) effect and will be assigned to the variable trt. We do this so that we have, in fact, two variables, dmi as the numberic quantity of protein supplement, and trt as the classification factor. This will facilitate the subsequent fitting of an over-parameterized model.

From the Analyais of Variance we see that there is a statistically significant effect of the dietary treatments (F value of 26.08 on numerator degrees of freedom 4 [since we have 5 treatments) and denominator degrees of freedom 10. Is there a linear trend, or a quadratic, or are the treatment effects simply without a 'trend'? The lsmeans command provides us the information below.

 trt   lsmean       SE 
 20  14.43333 1.572754 
 30  20.83333 1.572754 
 40  27.33333 1.572754 
 50  32.70000 1.572754 
 60  33.16667 1.572754 

We see that increasing amounts of protein supplement seem to be associated with increasing milk yield. However, is it a linear trend? Is it a quadratic relationship?

IFF the intervals between the various levels all are equal, which they are in this example [the protein supplement increments by 10 units from one treatment level to the next] then we can use the method of orthogonal polynomial contrasts to divide the Sums of Squares for treatments into that which we can attribute to a Linear trend, that which we can attribute to a Quadratic trend, that which we can attribute to a Cubic trend and that which we can attribute to a Quartic trend. The four orthogonal components together are equivalent to the classification effect (which we recall has 4 d.f.).

I can comprehend a Linear trend; I can comprehend a Quadratic trend. The combined effect of the Cubic and Quartic, over and above the Linear and Quadratic trends brings us to the equivalent Classification model. Thus, if we compute the effect of the Cubic and Quartic, over and above the Linear and Quadratic, we can determine whether there is a statistically significant improvement in the goodness of fit of the model by going from Linear and Quadratic regressions, to the more complicated Classification model. If the effect (of the Cubic and Quartic together) is significant, then it means that the more complex Classification model is worth it; if the Cubic and Quartic are (together) not statistically significant, then the Classification model is not significantly better and we should drop the Classification model, and examine the Linear and Quadratic regressions.

Any statistical text will provide tables of the coefficients for various simple cases of orthogonal polynomial contrasts; typically up to about 5 or 6 degrees of freedom, see Steel, Torrie and Dickey.

For the case of 5 treatments, hence 4 d.f. we have

            Treatments
          T1   T2   T3   T4   T5
 Cubic    -1   2    0    -2   1
 Quartic  1    -4   6    -4   1

We can use these to generate our F-test, using the glht package.

My recommendation is to always start by estimating the various fitted values and then construct the various contrasts amongst linear functions of these fitted values; it will, in the long-run, save mistakes. In my opinion it is all to easy to make mistakes in deducing the contrast coefficients and R (and glht) will gladly allow one to compute a completely erroneous set of contrasts and hence a garbage F-test, all without any indication that the results are meaningless.

The 5 fitted values, corresponding to the 5 treatments, can be estimated using K' matrices thus (see my notes for Statistical Methods II).


 mu + t1 = (1  0 0 0 0)
 mu + t2 = (1  1 0 0 0)
 mu + t3 = (1  0 1 0 0)
 mu + d4 = (1  0 0 1 0)
 mu + t5 = (1  0 0 0 1)

We can use the esticon function, fromt he package doBy, to estiamte these fitted values, which correspond to the least squares means. See the section of the R notes on a CRD for more details about using the esticon function.

From these above 5 K' matrices let us see what we get when we make use of the tablulated orthogonal polynomial contrast coefficients.

So, starting with the Cubic K matrix, take the coefficient for Treatment 1 ( -1 ) and multiply by the line for the fitted value for treatment 1:

 -1 * (1  0 0 0 0) =  (-1  0 0 0 0)

then take the coefficient for Treatment 2 ( 2 ) and multiply by the line for the fitted value for treatment 2:

 2 * (1  1 0 0 0) = (2  2 0 0 0)

then take the coefficient for Treatment 3 ( 0 ) and multiply by the line for the fitted value for treatment 3:

 0 * (1  0 1 0 0) = (0  0 0 0 0)

then take the coefficient for Treatment 4 ( -2 ) and multiply by the line for for the fitted value for treatment 4:

 -2 * (1  0 0 1 0) = (-2  0 0 -2 0)

then take the coefficient for Treatment 5 ( 1 ) and multiply by the line for for the fitted value for treatment 5:

 1 * (1  0 0  1) = (1  0 0 0 1)

Now add these together:

   (-1  0 0 0 0)
   (2  2 0 0 0)
   (0  0 0 0 0)
   (-2  0 0 -2 0)
   (1  0 0 0 1)

 = (0  2 0 -2 1)

this is the K' construct for the coefficients for the Cubic orthogonal polynomial under our (R) equation arrangement.

We can do likewise for the Quartic effect; we obtain

  (0  -4 6 -4 1)

this is the K' construct for the coefficients for the Quartic orthogonal polynomial under our (R) equation arrangement. Our R code could thus be:


Cubic   <- c(0,  2, 0, -2, 1)
Quartic <- c(0,  -4, 6, -4, 1)

# combine together as a 2-row K' matrix
#
K <- rbind(Cubic,Quartic)

We can then use the glht function from the multcomp package:

kb <- glht(lm1,linfct=K)

summary(kb,test=Ftest())

The results from this are:


	 General Linear Hypotheses

Linear Hypotheses:
             Estimate
Cubic == 0     -5.000
Quartic == 0   -2.533

Global Test:
       F DF1 DF2 Pr(>F)
1 0.5239   2  10 0.6076

We see that the F-test for the Marginal combined effect of the Cubic and Quartic effects, over and above any Linear and Quadratic components gives a calculated F-value of 0.5239, which, with 2 d.f. for the numerator, and 10 d.f. for the denominator, gives a probability of 0.6076. This means that there is a 60% probability of obtaining the calculated F-value of 0.52 when the Null Hypothesis is correct. We can therefore conclude that there is no significant improvement in the goodness of fit of the model by including the Cubic and Quartic over and above the Linear and Quadratic effects; this means that the Classificaiton model is not statistically significantly better than the regression model. Thus we can drop the classification approach and re-run our analyses with Linear and Quadratic regressions, at which point we shall then examine whether the quadratic effect is significant, or not.

So, we can fit the below model:


lm2 <- lm(my ~ dmi + I(dmi^2),data=ds)
anova(lm2)
summary(lm2)

Note how we use the variable dmi, which was left as a numeric value, whereas trt was declared as a factor effect.

Analysis of Variance Table

Response: my
          Df Sum Sq Mean Sq  F value    Pr(>F)    
dmi        1 730.13  730.13 106.8726 2.498e-07 ***
I(dmi^2)   1  36.21   36.21   5.3008   0.04003 *  
Residuals 12  81.98    6.83                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Call:
lm(formula = my ~ dmi + I(dmi^2), data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5886 -2.3267 -0.0886  1.8876  4.2448 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -7.040000   5.998405  -1.174  0.26331   
dmi          1.236190   0.326161   3.790  0.00258 **
I(dmi^2)    -0.009286   0.004033  -2.302  0.04003 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.614 on 12 degrees of freedom
Multiple R-squared:  0.9034,	Adjusted R-squared:  0.8873 
F-statistic: 56.09 on 2 and 12 DF,  p-value: 8.145e-07

We see that the Marginal effect of the quadratic effect of dmi [shown as I(dim^2)] is statistically significant. We shall therefore retain the Quadratic effect of dmi, and thus also keep the Linear effect in the model.

NOTE: the use of orthogonal contrasts is appropriate ONLY when the intervals between the [implied] levels of the treatment effect are ALL equal; if the differences are not equal then using orthogonal contrats will result sin incorrect results. Thus, in the case where the differences between the elevels are not all equal, then we should use an overparameterised model to examine whether the classification modle is a significantly better fit over and above a regression model:


# now fit as an overparamterised model
lm2 <- lm(my ~ dmi + I(dmi^2) + trt,data=ds)
anova(lm2)

Here we have fitted mi as linear and quadratic regessions, AND trt as a classification factor. We make use of the fact that the effect of the treatments (trt, as a [classificatioon] factor) is fitted AFTER the linear and quadratic regressions, and hence the Type I Sums of Squares for trt and those over and above the Linear and quadratic regressions. Indeed, when we examine the resulting ANOVA

> lm2 <- lm(my ~ dmi + I(dmi^2) + trt,data=ds)
> anova(lm2)
Analysis of Variance Table

Response: my
          Df Sum Sq Mean Sq F value    Pr(>F)    
dmi        1 730.13  730.13 98.3919 1.712e-06 ***
I(dmi^2)   1  36.21   36.21  4.8802   0.05164 .  
trt        2   7.78    3.89  0.5239   0.60762    
Residuals 10  74.21    7.42                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

we see that TRT now has only 2 degrees of freedom, which makes sense since we have already used up 2 of the dregrees of freedom with our ALinear and Quadratic regression. We can see that the [Marginal] effect of treatmments, over and above the Linear and Quadratic regressions is Not Satistically Significant. We shall not examine the regression effects; we shall conlcude that we can drop the classification effect, re-run the model with the Linear and Quadratic effects and THEN examine the significance of the Quadratci effect.


R.I. Cue ©
Department of Animal Science, McGill University
last updated : 2017 July 29th