Multiple Regression I, R version

Multiple Regression I

See Steel, Torrie and Dickey, Chapter 14. Also review Chapters 10, 11, 12 and 13.

This is a first attempt at an R version of the multiple regression example.

This first problem in multiple regression is taken from Steel, Torrie and Dickey, Chapter 14, Table 14.1. Nitrogen %, Chlorine % and Potassium % were measured in soil samples taken from various plots of tobacco plants. For each plot the leaf burn was measured. The natural logarithm of the leaf burn was used as the dependent variable. The object of the study was to see if the above mentioned factors (N%, Cl% and K%) affected leaf burn.

Alternatively, if one wishes to consider an example using animals let us imagine the same basic data, but supposing that we had 30 cows and we were conducting an experiment to look at the effects of Energy density of the diet and Acid Base Balance (Chlorine % and Potassium %) on milk yield. The cows are randomly assigned and fed different diets each with various Energy densities and differing amounts of Chlorine % and Potassium % and the milk yield for each cow is recorded.

Note that this data is the same as that described earlier in the 'Introduction to the SAS Language'.

Data from Steel, Torrie and Dickey
Nitrogen % Chlorine % Potassium % Log leaf burn

Energy Density Chlorine % Potassium % Milk Yield

3.05 1.45 5.67 0.34

4.22 1.35 4.86 0.11

3.34 0.26 4.19 0.38

3.77 0.23 4.42 0.68

3.52 1.10 3.17 0.18

3.54 0.76 2.76 0.0

3.74 1.59 3.81 0.08

3.78 0.39 3.23 0.11

2.92 0.39 5.44 1.53

3.10 0.64 6.16 0.77

2.86 0.82 5.48 1.17

2.78 0.64 4.62 1.01

2.22 0.85 4.49 0.89

2.67 0.90 5.59 1.40

3.12 0.92 5.86 1.05

3.03 0.97 6.60 1.15

2.45 0.18 4.51 1.49

4.12 0.62 5.31 0.51

4.61 0.51 5.16 0.18

3.94 0.45 4.45 0.34

4.12 1.79 6.17 0.36

2.93 0.25 3.38 0.89

2.66 0.31 3.51 0.91

3.17 0.20 3.08 0.92

2.79 0.24 3.98 1.35

2.61 0.20 3.64 1.33

3.74 2.27 6.50 0.23

3.13 1.48 4.28 0.26

3.49 0.25 4.71 0.73

2.94 2.22 4.58 0.23

Data from Steel, Torrie and Dickey
Nitrogen %	Chlorine %	Potassium %	Log leaf burn
Energy Density	Chlorine %	Potassium %	Milk Yield
3.05	1.45	5.67	0.34
4.22	1.35	4.86	0.11
3.34	0.26	4.19	0.38
3.77	0.23	4.42	0.68
3.52	1.10	3.17	0.18
3.54	0.76	2.76	0.0
3.74	1.59	3.81	0.08
3.78	0.39	3.23	0.11
2.92	0.39	5.44	1.53
3.10	0.64	6.16	0.77
2.86	0.82	5.48	1.17
2.78	0.64	4.62	1.01
2.22	0.85	4.49	0.89
2.67	0.90	5.59	1.40
3.12	0.92	5.86	1.05
3.03	0.97	6.60	1.15
2.45	0.18	4.51	1.49
4.12	0.62	5.31	0.51
4.61	0.51	5.16	0.18
3.94	0.45	4.45	0.34
4.12	1.79	6.17	0.36
2.93	0.25	3.38	0.89
2.66	0.31	3.51	0.91
3.17	0.20	3.08	0.92
2.79	0.24	3.98	1.35
2.61	0.20	3.64	1.33
3.74	2.27	6.50	0.23
3.13	1.48	4.28	0.26
3.49	0.25	4.71	0.73
2.94	2.22	4.58	0.23

Thus a suitable linear model involving the 3 regression effects would be :

model Y_i = b₀ + b₁ X_i1 + b₂ X_i2 + b₃ X_i3 + animal_i + e_i

We have only 1 measurement/animal. Therefore we cannot seperate the effect animal_i (the random effect of animal i) from e_i (the random error associated with measurement i). Thus the effects animal_i and e_i are confounded one with another and will be lumped together as one combined term, e_i, where e_i = animal_i + random error_i. Thus we will re-write the model as

model Y_i = b₀ + b₁ X_i1 + b₂ X_i2 + b₃ X_i3 + animal_i + e_i

data, suitable to read into R

R statements

Here are my initial R statements to read in the data and then display a summaryof the various variables.

FilePath <- "full path to your data directory/"
FileName <- "regress1.txt"

filen <- paste(FilePath,FileName,sep="")

ds<-read.table(filen,header=TRUE)
summary(ds)

I put the path to the data directory in one variable (FilePath) and then concatenate it (using the paste function) together with the actual filename (FileName); that way if I move the whole directory to another location it is very easy to modify. Also, IF I want to write out some of the results to a file in that same directory, then I have the path already specified. In this example, the data file has the variable names as the first row (hence the header=TRUE option on the read.table command). The function summary(ds) provides some simple summary statistics for each variable, such as min, max (for continuous variables) and frequency counts for class (R factors) variables.

Here is the R code for a basic linear model (as per SAS proc glm).

# model 1, all fixed effects
lm1 <- lm(y~x1+x2+x3,data=ds)
lm1
print(lm1,digits=7)
summary(lm1)
anova(lm1)

Unlike SAS, where in glm we define whether a variable is a class (factor) variable, or [otherwise] considered as a continuous [numeric] variable, in R we have to have already defined whether a variable is a factor [using the as.factor() function], or numeric [using the as.numeric() function). In this example we have defined x1, x2 and x3 as numeric (ie regression variables/covariates).

lm1 <- lm(y~x1+x2+x3,data=ds)

This function lm() fits a general linear model, with dependent variable y, and regression variables x1, x2 and x3 as explanatory regression variables. These all come from the data-frame ds. The default error distribution is Normal, alternatives are: binomial and Poisson. The lm function fits a fixed effects linear model; although additional random effects can be fitted these are very much in the classical balance ANOVA schema. Unbalanced mixed models are better analysed using the lmer function from the lmerTest package.

The results are stored in the lm1 object. Typing lm1 is equivalent to using the print(lm1) function (as shown in the subsequent command line. Using the print function allows us to print either all of the components of the lm1 object, or only specific sub-elements thereof. The print function also allows us to control the number of significant digits displayed from the results (useful if we need to obtain more than the default). The anova() command will produce an Analysis of Variance table. It should be borne in mind that the results are a sequential, TypeI sums of Squares model and not a TypeIII Marginal Effects model!

The Sums of Squares for the last term in the specified model (x3) are the Marginal Sums of Squares. This if we want a TypeIII Sums of Square, Marginal model, we should make use of the car package and the Anova function to obtain our ANOVA table.

	Sum Sq	Df	F value	Pr(>F)
x1	2.66	1	58.35	0.0000
x2	1.65	1	36.23	0.0000
x3	1.20	1	26.44	0.0000
Residuals	1.18	26

To obtain various estimable values we can use the esticon package, and theh xtable (to produce the below table, using HMTL output code).

## use esticon to estimate things, to show how to use
# this estimates b1
lambda1 <- c(0,1,0,0)
esticon(lm1,lambda1)
print(xtable(esticon(lm1,lambda1)),type="HTML")

beta0 Estimate Std.Error t.value DF Pr(>|t|) Lower Upper

1 0.00 -0.53 0.07 -7.64 26.00 0.00 -0.67 -0.39