How do we decide whether to include an independent variable, and how do we decide whether to exclude an independent variables from our model?
If we have developed, or hypothesised, a model and we find that each of the factors is statistically significant, then we will keep them in our model; there is no problem there!
Consider that we have an experiment where we think that two factors, X1 and X2, are likely to be important and to have a real effect on our dependent variable (Y). We have a quite large sample size and obtain :
Suppose : b1 = 0.5 ± 0.05 => tcalc = 0.5/0.05 = 10
and b2 = 0.4 ± 0.02 => tcalc = 0.4/0.02 = 20
These are both statistically significant, we retain them in the model, there is no problem.
However, if a smaller sample size had been used the standard errors we would likely have obtained would have been proportionately larger, by a factor of the square root of n (since the sampling variance would be n times larger). So if we had a sample size only 1/4 as large, then the sampling variance would be 4 times as large and the standard errors would be twice as large. N.B. This comes from basic, introductory statistics; sampling variances are inversely proportional to the sample size.
Thus suppose that we had obtained:
b1 = 0.5 ± 0.1 => t = 0.5/0.1 = 5
and b2 = 0.4 ± 0.04 => t = 0.4/0.04 = 10
Now, if b1 is not statistically significant (because of the smaller sample size) and we accept Ho (that there is no effect of factor b1) we may seriously bias the other factors!
Even if we can accept Ho that does not prove that there is no relation. So if we have good reason to believe that X1 (b1) has an effect then we should be reticent to eliminate it.
