Quantitative/Qualitative Data: Classification or Regression ?

Treating Quantitative Data as Regressions or Classification

Many factors are quantitative in nature and cause us to incorporate them in a model as regression effects. For example, the data presented earlier relating leaf burn in tobacco plants to the Nitrogen, Chloride and Phosphorus content. Nitrogen%, Chloride% and Phosphorus% were considered as the independent variables to explain variation in leaf burn. They were considered as Quantitative/Covariate Regression variables. We were looking at the effect of each unit increase (or decrease) in N% on leaf burn; and similarly for the other variables. Nitrogen%, Cl% and P% are quantitative; 5% is half of 10%, which is half of 20%, etc.

Many factors are qualitative in nature and cause us to incorporate them in a model as classification effects (qualitative variables with classification 'levels'). Classification, or qualitative, variables cannot be considered as quantities. For example, if we are looking at the growth rate of male and female pigs, we can call them M and F, or 1 and 2 respectively (or 2 and 1 also). The 'levels' of sex (M and F) are qualitative in nature, there is no 'quantity' of sex associated with being a Male or a Female. The labels M and F are 'qualities' and serve only to identify or differentiate which animals are of one sex or the other.

The SAS analyses of the data presented in the class notes, with 5 diets and 5 rats per Diet is shown here, Rats!.

Model fitting diet as 5 classes (4 degrees of freedom), a regular classification model, CRD.

The GLM Procedure

Class Level Information
Class	Levels	Values
diet	5	1 2 3 4 5

Number of observations	25

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	1985.674400	496.418600	148.53	<.0001
Error	20	66.844000	3.342200
Corrected Total	24	2052.518400

We can see here the ANOVA table produced from the data and SAS PROC GLM code given in the course notes. Note the effect of Diet is statistically significant, F-ratio = 148.53. This is the effect of Diets (5 'levels', 4 degrees of freedom) over and above the Mean, R(Diet|µ)

R-Square	Coeff Var	Root MSE	wt Mean
0.967433	4.479927	1.828168	40.80800

Source	DF	Type I SS	Mean Square	F Value	Pr > F
diet	4	1985.674400	496.418600	148.53	<.0001

Source	DF	Type III SS	Mean Square	F Value	Pr > F
diet	4	1985.674400	496.418600	148.53	<.0001

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
mu+d1	23.9600000	0.81758180	29.31	<.0001
mu+d2	39.5800000	0.81758180	48.41	<.0001
mu+d3	45.8400000	0.81758180	56.07	<.0001
mu+d4	48.1600000	0.81758180	58.91	<.0001
mu+d5	46.5000000	0.81758180	56.88	<.0001

The GLM Procedure

Least Squares Means

Adjustment for Multiple Comparisons: Scheffe

diet	wt LSMEAN	Standard Error	Pr > \|t\|	LSMEAN Number
1	23.9600000	0.8175818	<.0001	1
2	39.5800000	0.8175818	<.0001	2
3	45.8400000	0.8175818	<.0001	3
4	48.1600000	0.8175818	<.0001	4
5	46.5000000	0.8175818	<.0001	5

Least Squares Means for effect diet Pr > \|t\| for H0: LSMean(i)=LSMean(j) Dependent Variable: wt
i/j	1	2	3	4	5
1		<.0001	<.0001	<.0001	<.0001
2	<.0001		0.0008	<.0001	0.0003
3	<.0001	0.0008		0.4275	0.9872
4	<.0001	<.0001	0.4275		0.7253
5	<.0001	0.0003	0.9872	0.7253

Model fitting a 4^th order polynomial (linear, quadratic, cubic and quartic regressions), providing EXACTLy the same fit, SSR_m, etc.

The GLM Procedure

Number of observations	25

Dependent Variable: wt

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	1985.674400	496.418600	148.53	<.0001
Error	20	66.844000	3.342200
Corrected Total	24	2052.518400

R-Square	Coeff Var	Root MSE	wt Mean
0.967433	4.479927	1.828168	40.80800

Source	DF	Type I SS	Mean Square	F Value	Pr > F
diet	1	1439.697800	1439.697800	430.76	<.0001
*dietdiet**	1	529.375000	529.375000	158.39	<.0001
*dietdietdiet*	1	14.472200	14.472200	4.33	0.0505
*dietdietdietdiet**	1	2.129400	2.129400	0.64	0.4341

Source	DF	Type III SS	Mean Square	F Value	Pr > F
diet	1	15.33436985	15.33436985	4.59	0.0447
*dietdiet**	1	5.36020200	5.36020200	1.60	0.2199
*dietdietdiet*	1	2.87482134	2.87482134	0.86	0.3648
*dietdietdietdiet**	1	2.12940000	2.12940000	0.64	0.4341

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	-11.90000000	12.95293171	-0.92	0.3692
diet	50.97166667	23.79644517	2.14	0.0447
*dietdiet**	-18.06250000	14.26275753	-1.27	0.2199
*dietdietdiet*	3.17833333	3.42696943	0.93	0.3648
*dietdietdietdiet**	-0.22750000	0.28501584	-0.80	0.4341

Here we see the data analysed using a 4^th order polynomial, diet, diet², diet³ and diet⁴, and note the we obtain EXACTLY the same F-ratio for the effects. This is because R(diet ,diet², diet³ and diet⁴ |µ) is an equivalence for the model with diets as 5 classes.

Molde fitting BOTH regressions (linear and quadratic) and classification, with the objective of seeing the marginal improvement by the classes over and above the regressions. An 'Over-paramereized Model'.

The GLM Procedure

Class Level Information
Class	Levels	Values
diet	5	1 2 3 4 5

Number of observations	25

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	1985.674400	496.418600	148.53	<.0001
Error	20	66.844000	3.342200
Corrected Total	24	2052.518400

R-Square	Coeff Var	Root MSE	wt Mean
0.967433	4.479927	1.828168	40.80800

Source	DF	Type I SS	Mean Square	F Value	Pr > F
d	1	1439.697800	1439.697800	430.76	<.0001
*dd**	1	529.375000	529.375000	158.39	<.0001
diet	2	16.601600	8.300800	2.48	0.1088

Source	DF	Type III SS	Mean Square	F Value	Pr > F
d	0	0.00000000	.	.	.
*dd**	0	0.00000000	.	.	.
diet	2	16.60160000	8.30080000	2.48	0.1088

Here we see another model fitted, this one may look a little strange, but in fact it is a combination of the regression and classification. It allows us to test whether the classification provides a statistically significant improvement in the fit of the model ovef and above the linear and quadratic regressions. What we have done (check the PROC GLM statements, is to include the effect d and d², which correspond to linear and quadratic effects of diet since they were not declared as CLASS variables, BEFORE Diet, which was declared as a class variable. This means that when we come to compute the marginal effect of Diet over and above the other effects in the model (the advantage of the Type III Sums of Squares), we are looking for R(Diet|µ, d, d²). But, and note that this is the 'tricky' bit, Diet is equal to d, d², d³ and d⁴, the 4th order polynomial. Thus this is effectively equivalent to R(d, d², d³, d⁴ | µ,d, d²). Now we have d and d² being fitted twice! Thus when we come to fit Diet, as 5 classes/4 degrees of freedom, we have already fitted d and d², and hence there is nothing new to fit for them. Therefore GLM realises that this is happening (because the rank is 2 d.f. less than one would otherwise think) and only provides us with 2 extra d.f. for Diet, othe and above what we have already effectively fitted; together with the Marginal Sums of Squares. Thus we get a direct and immediate test of whether the 5 classes was better than the linear and quadratic regressions. In this case we can see the the F-ratio for Diet over and above the linear and quadratic regressions was 2.48, which is NOT Statistically Significant. Thus the 5 classes DOES NOT provide a better fit, and hence we can simplify our model to be:

Y_i = b₀ + b₁Diet + b₂ Diet² + e_i

Consider the case when we have data where the independent variable (X) can be considered as a quantitative variable and /or as a classification variable with 'levels'.

For example suppose we have several plots to which we apply various amounts of Nitrogen fertiliser and then record the yield of maize. Let us suppose that we are going to examine the effect of applying Nitrogen at the rate of : 10kg, 20kg, 30kg, 40kg and 50kg.

Random assignment of plots (experimental units) to treatments
N=10, Y=20.5	N=30, Y=43.1	N=50, Y=46.4
N=40, Y=45.0	N=10, Y=21.6	N=20, Y=36.4
N=50, Y=48.4	N=10, Y=22.9	N=30, Y=44.4
N=20, Y=38.5	N=50, Y=47.3	N=30, Y=44.2
N=20, Y=38.0	N=40, Y=45.5	N=40, Y=46.1

We could consider Nitrogen to be a quantitative variable, to be fitted as a Regression covariate(s), or as a Classification factor with 5 'levels' or 'classes'. Run both these analyses and compute, and plot, the fitted values for each.

Possible models :

One-way, CRD

Analysis of Covariance

Comparison of both models

Combine both models

Another example

One-way, Completely Randomised Design

We could analyse this data as a one-way Analysis of Variance from a Completely Randomised Design. There would be the factor 'Nitrogen' with 5 levels. A suitable model would be :

Y_ij = µ + N_i + e_ij

where i = 1 to 5 levels of N

The SAS code would therefore be :

data crdreg1;
input n wt;
cards;
10  20.5
10  21.6
10  22.9
20  36.4
20  38.5
20  38.0
30  43.1
30  44.4
30  44.2
40  45.0
40  45.5
40  46.1
50  46.4
50  47.3
50  48.4
;
proc glm data=crdreg1;
classes n;
model wt = n;
lsmeans n/stderr pdiff;
run;

Compute the estimate of the mean for each Nitrogen level (the Least Squares Mean, i.e. µ + Nitrogen₁, µ + Nitrogen₂, etc).

SAS estimate statements
estimate 'n = 10' intercept 1 n 1 0 0 0 0;
estimate 'n = 20' intercept 1 n 0 1 0 0 0;
etc..

We could also use the concept of contrasts to partition the Sums of Squares for the treatment (Nitrogen) into Linear, Quadratic, Cubic and Quartic components (see Steel, Torrie and Dickey, Ch15.7). This approach is reasonable and appropriate IF AND ONLY IF (IFF) the 'spacing' between the treatments (or the increment from one treatment level to the next) is the same. In this case, with 3 experimental units for all treatments we have a balanced design and the increment from one treatment (level) of Nitrogen to the next is the same (increments of 10kg N). The SAS contrasts are:

contrast 'Cubic and Quartic' n -1 2 0 -2 1,
                             n 1 -4 6 -4 1;

Analysis of Covariance

We could analyse this data as an Analysis of Covariance or Regression experiment with Nitrogen as regressions. Nitrogen could be fitted as linear, quadratic, cubic and quartic regressions. Thus a suitable model would be :

Y_jk = µ + blk_j + b₁ N_k + b₂ N_k² + b₃ N_k³ + b₄ N_k⁴ + e_jk

The SAS code would therefore be :

data crdreg1;
input n wt;
cards;
10  20.5
10  21.6
10  22.9
20  36.4
20  38.5
20  38.0
30  43.1
30  44.4
30  44.2
40  45.0
40  45.5
40  46.1
50  36.4
50  37.3
50  38.4
;
proc glm data=crdreg1;
model wt = n n*n n*n*n n*n*n*n/solution;
estimate 'n, linear ' n 1;
estimate 'n, quadratic ' n*n 1;
estimate 'n, cubic ' n*n*n 1;
estimate 'n, quartic ' n*n*n*n 1;
run;

Compute the estimate of the mean for each amount of Nitrogen (10 kg Nitrogen, 20 kg Nitrogen, etc). You should find that you have estimated exactly the same fitted value as previously when there were 5 classes.

SAS estimate statements
estimate 'N=10' intercept 1 n 10 n*n 100 n*n*n 1000 n*n*n*n 10000;
estimate 'N=20' intercept 1 n 20 n*n 400 n*n*n 8000 n*n*n*n 160000;
etc..

Comparison of both models

SAS code for both models

We can compare the two models. We find that the Sums of Squares for the model are the same in both cases. If we compute the Contrast Sums of Squares for the joint test of all the regression parameters (4), with 4 d.f. we find that they are the same as the Sums of Squares for Nitrogen from the CRD design one-way ANOVA where Nitrogen had 5 levels, and hence 4 d.f. Think about this in the context of a graph where we have 2 points; with 2 points we need a straight line to provide a perfect fit to the the two points, i.e. 1 degree of freedom. If we had 3 points we know that a quadratic (linear+quadratic evidently!) will provide a perfect fit through three points, i.e. 2 degrees of freedom. With 4 points (levels) we will have 3 d.f., and a 3^rd order polynomial (linear + quadratic + cubic) will be exactly equivalent. Likewise, with 5 classes (4 d.f.) a 4^th order polynomial (linear + quadratic + cubic + quartic) is equivalent.

proc glm data=crdreg1;
model wt = n n*n n*n*n n*n*n*n;
contrast 'Regressions' n  1,
                       n*n 1,
                       n*n*n 1,
                       n*n*n*n 1;

We have 4 d.f., whether we have Nitrogen as 5 classes, with 4 d.f., or Nitrogen as a continuous quantitative trait with linear, quadratic, cubic and quartic regressions. The Sums of Squares for 'Regressions' (above) are equal to the Sums of Squares for Nitrogen (as 5 classes). So, fit a contrast for the Marginal Sums of Squares for the Cubic and Quartic regressions over and above the Linear and Quadratic :

contrast 'N3 and N4' n3 1,
                     n4 1;

These Sums of Squares are the Marginal Sums of Squares for adding both Cubic and Quartic regressions, over and above Linear and Quadratic. Now analyse this same data using a Regression model, but rather than fitting Linear, Quadratic, Cubic and Quartic regressions, fit only Linear and Quadratic. Compare the Residual Sums of Squares and d.f.e.. Also plot the fitted values for both cases (5 classes, and Linear and Quadratic regressions, graph.

Compute the difference between the Residual Sums of Squares for the model with only n and n*n, and the Residual Sums of Squares for the model with Nitrogen (as 5 classes). Do the same for the Residual degrees of freedom. What do you notice about these Sums of Squares and degrees of freedom?

Yes! They are the same as those for the joint contrast of the Cubic and Quartic regressions! Thus we can use this technique of fitting the same quantitative trait as classes or regressions to compare whether a simple Linear regression (or Linear + Quadratic regressions) provides a suitable fit vs. whether (5) classes provides a statistically significantly better fit. Linear regression (or Linear + Quadratic regressions) is a more parsimonious model (i.e. simpler), but depending upon the fitted values the class approach may, or may not, yield an improvement.

Analyses of Variance

Model 1, fitting Nitrogen as 5 classes
Source d.f. Sums of Squares Mean Squares F-ratio Pr

SSR_m 4 1316.297 329.07 370.30 0.0001

Residual 10 8.887 0.889

Model 1, fitting Nitrogen as 5 classes
Source	d.f.	Sums of Squares	Mean Squares	F-ratio	Pr
SSR_m	4	1316.297	329.07	370.30	0.0001
Residual	10	8.887	0.889

Model 2, fitting Nitrogen as 4 regressions
Source d.f. Sums of Squares Mean Squares F-ratio Pr

SSR_m 4 1316.297 329.07 370.30 0.0001

N, Type I 1 1054.947 1054.947 1187.11 0.0001

N², Type I 1 231.945 231.945 261.00 0.0001

N³, Type I 1 29.403 29.403 33.09 0.0002

N⁴, Type I 1 0.002 0.002 0.00 0.96

Residual 10 8.887 0.889

Model 2, fitting Nitrogen as 4 regressions
Source	d.f.	Sums of Squares	Mean Squares	F-ratio	Pr
SSR_m	4	1316.297	329.07	370.30	0.0001
N, Type I	1	1054.947	1054.947	1187.11	0.0001
N², Type I	1	231.945	231.945	261.00	0.0001
N³, Type I	1	29.403	29.403	33.09	0.0002
N⁴, Type I	1	0.002	0.002	0.00	0.96
Residual	10	8.887	0.889

Contrast Sums of Squares, d.f. and Mean Squares
Source d.f. Sums of Squares Mean Squares

Regressions 4 1316.297 329.07

N³ and N⁴ 2 29.405 14.703

Contrast Sums of Squares, d.f. and Mean Squares
Source	d.f.	Sums of Squares	Mean Squares
Regressions	4	1316.297	329.07
N³ and N⁴	2	29.405	14.703

Model 3, fitting Nitrogen as Linear and Quadratic regressions
Source d.f. Sums of Squares Mean Squares

SSR_m 2 1286.892

N, Type I 1 1054.947 1054.947

N², Type I 1 231.945 231.945

Residual 12 38.292 3.191

Model 3, fitting Nitrogen as Linear and Quadratic regressions
Source	d.f.	Sums of Squares	Mean Squares
SSR_m	2	1286.892
N, Type I	1	1054.947	1054.947
N², Type I	1	231.945	231.945
Residual	12	38.292	3.191

Comparison of Models 1 and 3
Source d.f. Sums of Squares

Model 3 dfe 12 38.292

Model 1 dfe 10 8.887

3 - 1 2 29.405

Comparison of Models 1 and 3
Source	d.f.	Sums of Squares
Model 3 dfe	12	38.292
Model 1 dfe	10	8.887
3 - 1	2	29.405

Thus we have obtained the degrees of freedom and Sums of Squares for the marginal improvement in going from a model with linear and quadratic regressions to a model with 5 classes (4 d.f.); using either Model 2, or Model 1 and Model 3 and computing the difference in d.f.e. and S.S.E. We can see that there is a statistically significant effect of the combined effect of the cubic and quartic regressions, or equivalently, that there is still a statistically significant improvement in going from a model with only linear and quadratic regressions to one with 5 classes (and hence 4 d.f.). Thus we conclude that Model 1 is the most appropriate. If the marginal effect of the cubic and quartic regressions had not been significant we would have concluded that there was no improvement in going from linear and quadratic regressions to 5 classes and that hence it would have been simpler, and more parsimonious, to use a model with linear and quadratic regressions.

These concepts can be extended to see whether simply a linear regression effect would have been adequate.

Combining Regressions AND Classifications

The approach detailed above is the classical orthogonal polynomial method detailed in Steel, Torrie and Dickey, and inumerable other statistical textbooks. However, as mentioned above, it has the limitation that if the intervals are not equal then it becomes difficult to compute what the coefficients should be. In addition, if we have a factor with several/many levels then the order of the polynomial can rapidly get out of hand, and in addition it will be very easy for us to accidently miss out one of the regressions. A technique that is easily used is to exploit the fact that SAS PROC GLM (and PROC MIXED) allow us to 'overfit' a model, or over-parameterise, by fitting both regressions and classes of the same effect. Obviously we cannot simply fit the variable n as both a classification variable and a regression covariate, because if we specify in PROC GLM that n is a class variable then it cannot simultaneously be a regression covariate. What we can do is create a second variable with exactly the same value as n and then use that as a classification variable and n as the regression. Thus, suppose that we write the model as:

Y_ij = µ + b₁*N_i + b₂*N_i² + grp_i + e_ij

Where N_i is the linear regression covariate and b₁ is the linear regression coefficient; and N_i² is the quadratic regression covariate and b₂ is the quadratic regression coefficient. grp_i is the effect of group (Nitrogen treated as 5 classes, 4 d.f.) over and above the effect of the Linear and Quadratic regression effects.

We have to create the second variable in the SAS data step, thus the SAS code would therefore be :

data crdreg1;
input n wt;
grp = n;
cards;
10  20.5
10  21.6
10  22.9
20  36.4
20  38.5
20  38.0
30  43.1
30  44.4
30  44.2
40  45.0
40  45.5
40  46.1
50  36.4
50  37.3
50  38.4
;
proc glm data=crdreg1;
classes grp;
model wt = n n*n grp;
run;

Another example

Here is another data set to analyse to compare and contrast with the first example. Repeat the analyses, calculations, estimation of fitted values and plotting of points. A group of 15 dairy cows were fed various amounts of feed and their milk yield recorded:

Data     Independent   Dependent
          variable     variable
Cow       Kg feed      Amount of milk
1          20            11.1
2          20            16.6
3          20            15.6
4          30            18.1
5          30            22.8
6          30            21.6
7          40            25.2
8          40            30.1
9          40            26.7
10         50            28.6
11         50            33.7
12         50            35.8
13         60            34.5
14         60            31.4
15         60            33.6

After comparing analyses from this data what are your conclusions?

Additional references

Steel, Torrie and Dickey. Ch. 15.6 and 15.7