Analysis of Variance (2024)

1Two-Way ANOVA

To express the idea of an interaction in the R modeling language, we need to introduce two new operators. The colon (:) is used to indicate an interaction between two or more variables in model formula. The asterisk(*) is use to indicate all main effects and interactions among the variables that it joins. So, for example the term A*B would expandto the three terms A, B, and A:B. As an exampleof a two-way ANOVA, consider a study to determine the effects of physicalactivity on obesity. Subjects were rated for their physical activity on a three point scale with 1=not very active, 2=somewhat active, and 3=very active.In addition, the race (either 1 or 2) of the participant was recorded, along with their Body Mass Index (BMI). We want to answer the following three questions:

Were the means for BMI the same for the two races?
Were the means for BMI the same for the three activity levels?
Is the effect of activity level different depending on race?, or equivalently Is the effect of race different depending on activity level?

The first two questions can be answered by looking at the race andactivity main effects, while the third question describes the race by activity interaction. The data can be found athttp://www.stat.berkeley.edu/classes/s133/data/activity.csv. Here are the R statements to runthe ANOVA:

>activity=read.csv('activity.csv')>activity$race=factor(activity$race)>activity$activity=factor(activity$activity)>activity.aov=aov(bmi~race*activity,data=activity)>summary(activity.aov)DfSumSqMeanSqFvaluePr(>F)race135523552102.5894<2e-16***activity22672133638.5803<2e-16***race:activity23011514.35080.01303*Residuals18656457435---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1

Notice that there are two degrees of freedom for activity - this meanstwo parameters will be estimated in order to explain activity's effect onbmi. Unlike linear regression, where only a single parameter is estimated, and the only relationship that can be fit is a linear one, using twoparameters (to account for the three levels of activity) provides more flexibilitythan would be possible with linear regression.To see if the analysis was reasonable, we can look at the default plots:

>plot(activity.aov)

The graphs appear below:

There seems to some deviation from normality when looking atthe Normal Q-Q plot (recall that, if the residuals did follow a normaldistribution, we would see a straight line.) When this situation arises,analyzing the logarithm of the dependent variable often helps. Here are the same results for the analysis of log(bmi):

>activity1.aov=aov(log(bmi)~race*activity,data=activity)>summary(activity1.aov)DfSumSqMeanSqFvaluePr(>F)race14.5884.588100.3741<2.2e-16***activity23.2511.62535.55966.98e-16***race:activity20.3170.1583.46250.03155*Residuals186585.2400.046---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1>plot(activity1.aov)

The Q-Q plot looks better, so this model is probably more appropriate.We can see both main effects as well as the interaction are significant.To see what's happening with the main effects, we can use aggregate:

>aggregate(log(activity$bmi),activity['race'],mean)racex113.122940223.222024>aggregate(log(activity$bmi),activity['activity'],mean)activityx113.242682223.189810333.109518

Race 2 has higher values of BMI than race 1, and BMI decreases as the level of activity increases.To study the interaction, we could use aggregate, passing both raceand activity as the second argument:

>aggregate(log(activity$bmi),activity[c('race','activity')],mean)raceactivityx1113.1611192213.2985763123.1409704223.2306515133.0844266233.143478

The arrangement of the output from tapply may be more helpful:

>tapply(log(activity$bmi),activity[c('race','activity')],mean)activityrace12313.1611193.1409703.08442623.2985763.2306513.143478

It's usually difficult to judge relationships like this from a table.One useful tool in this case isan interaction plot. An interaction plot has one pointfor each combination of the factors defined by an interaction. The x-axisrepresents the levels of one of the factors, and the y-axis represents the mean of the dependent variable, and a separate line is drawn for each level of the factor not represented on the x-axis. While it wouldn't be too hard to produce such a plot with basic commands in R, the process isautomated by the interaction.plot function. The first argument tothis function is the factor to appear on the x-axis; the second is the factor which will define the multiple lines being drawn, and the third argument is the dependent variable. By default, interaction.plotuses the mean for its display, but you can provide a function of your ownchoosing through the fun= argument. For the activity data, we can produce an interaction plot with the followingcode:

>with(activity,interaction.plot(activity,race,log(bmi)))

Here's the plot:

It can be seen that the interaction is due to the fact that the slope of the line for race 2 is steeper than the line for race 1.

2Another Example

This example has to do with iron retention in mice. Two different treatments,each at three different levels, were fed to mice. The treatments were tagged with radioactive iron, so that the percentage of iron retained couldbe measured after a fixed period of time.The data is presented in a table as follows:

----------------------------------------------------Fe2+Fe3+----------------------------------------------------highmediumlowhighmediumlow----------------------------------------------------0.712.202.252.204.042.711.662.933.932.694.165.432.013.085.083.544.426.382.163.495.823.754.936.382.424.115.843.835.498.322.424.956.894.085.779.042.565.168.504.275.869.562.605.548.564.536.2810.013.315.689.445.326.9710.083.646.2510.526.187.0610.623.747.2513.466.227.7813.803.747.9013.576.339.2315.994.398.8514.766.979.3417.904.5011.9616.416.979.9118.255.0715.5416.967.5213.4619.325.2615.8917.568.3618.4019.878.1518.3022.8211.6523.8921.608.2418.5929.1312.4526.3922.25----------------------------------------------------

Thus, before we can perform analysis on the data, it needsto be rearranged. To do this, we can use the reshapefunction. Since there are two different sets of variablesthat represent the change in the factors of the experiment,we first read in the data (skipping the header), and createtwo groups of variables in our call to reshape:

>iron0=read.table('iron.txt',skip=5,nrows=18)>names(iron0)=c('Fe2high','Fe2medium','Fe2low','Fe3high','Fe3medium','Fe3low')>iron1=reshape(iron0,varying=list(1:3,4:6),direction='long')>head(iron1)timeFe2highFe3highid1.110.712.2012.111.662.6923.112.013.5434.112.163.7545.112.423.8356.112.424.086

After examining the data, it can be seen that the low, medium, and high values have been translated into values 1, 2, and 3 in the variable time. The id variable is created to help us seewhich line each observation came from, which is not relevant in thiscase, since the table was just used to present the data, and the values in the table don't represent repeated measures on the sameexperimental unit.Next, we eliminate the id column, rename the column named"time" and further reshape the data to represent the two treatments:

>iron1$id=NULL>names(iron1)[1]='level'>iron=reshape(iron1,varying=list(2:3),direction='long')>head(iron)leveltimeFe2highid1.1110.7112.1111.6623.1112.0134.1112.1645.1112.4256.1112.426

All that's left is to remove the idcolumn and to rename time and Fe2high:

>iron$id=NULL>names(iron)[2:3]=c('treatment','retention')>head(iron)leveltreatmentretention1.1110.712.1111.663.1112.014.1112.165.1112.426.1112.42

Once the data has been reshaped, it's essential tomake sure that the independent variables are correctly storedas factors:

>sapply(iron,class)leveltreatmentretention"integer""integer""numeric"

Since treatment and level are not factors,we must convert them:

>iron$treatment=factor(iron$treatment,labels=c('Fe2+','Fe3+'))>iron$level=factor(iron$level,labels=c('high','medium','low'))>head(iron)leveltreatmentretention1.1highFe2+0.712.1highFe2+1.663.1highFe2+2.014.1highFe2+2.165.1highFe2+2.426.1highFe2+2.42

Now we can perform the ANOVA:

>iron.aov=aov(retention~level*treatment,data=iron)>summary(iron.aov)DfSumSqMeanSqFvaluePr(>F)level2983.62491.8117.07324.021e-07***treatment162.2662.262.16130.1446level:treatment28.294.150.14390.8661Residuals1022938.2028.81

Before proceeding further, we should examine the ANOVA plots to see if the data meets the assumptions of ANOVA:

Both the normal Q-Q plot and the scale-location plot indicateproblems similar to the previous example, and a log transformationis once again suggested. This is not unusual when data is measured as percentages or ratios.

>ironl.aov=aov(log(retention)~level*treatment,data=iron)>par(mfrow=c(2,2))>plot(ironl.aov)

The plots look much better, so we'll continue with the analysisof the log of retention.

DfSumSqMeanSqFvaluePr(>F)level215.5887.79422.52417.91e-09***treatment12.0742.0745.99310.01607*level:treatment20.8100.4051.17080.31426Residuals10235.2960.346---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1

Since there were only two levels of treatment, the significanttreatment effect means the two treatments were different. We can use the TukeyHSD function to see if the different levels of the treatmentwere different:

>TukeyHSD(ironl.aov,'level')Tukeymultiplecomparisonsofmeans95%family-wiseconfidencelevelFit:aov(formula=log(retention)~level*treatment,data=iron)$leveldifflwruprpadjmedium-high0.57510840.245337740.90487910.0002042low-high0.92115880.591388061.25092950.0000000low-medium0.34605030.016279620.67582100.0373939

It appears that the high level had much lower retention valuesthan the other two levels:

>aggregate(log(iron$retention),iron['level'],mean)levelx1high1.4205262medium1.9956353low2.341685

Although there was no significant interaction, an interaction plotcan still be useful in visualizing what happens in an experiment:

>interaction.plot(iron$level,iron$treatment,log(iron$retention))

3More Complex Models

When working with the wine data frame, we've separated the categorical variable (Cultivar) from the continuous variable for pedagogical reasons, but the aov functioncan accomodate both in the same model. Let's add the Cultivarvariable to the regression model we've previously worked with:

>wine.new=lm(Alcohol~Cultivar+Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+Proline,data=wine)>summary(wine.new)Call:lm(formula=Alcohol~Cultivar+Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+Proline,data=wine)Residuals:Min1QMedian3QMax-1.13591-0.31737-0.026230.332291.65633Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)12.91584870.471114927.415<2e-16***Cultivar2-0.99579100.1776136-5.6078.26e-08***Cultivar3-0.67140470.2396380-2.8020.00568**Malic.acid0.05594720.04108601.3620.17510Alkalinity.ash-0.01335980.0134499-0.9930.32198Proanthocyanins-0.05614930.0817366-0.6870.49305Color.intensity0.11354520.02700974.2044.24e-05***OD.Ratio0.04946950.09879460.5010.61721Proline0.00023910.00022821.0480.29629---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1Residualstandarderror:0.4886on169degreesoffreedomMultipleR-Squared:0.6541,AdjustedR-squared:0.6377F-statistic:39.95on8and169DF,p-value:<2.2e-16

One problem with the summary display for models like this is thatit's treating our factor variable (Cultivar) as two separatevariables. While that is the way it is fit in the model, it's usuallymore informative to combine the effects of the two variables as a singleeffect. The anova command will produce a more traditionalANOVA table:

>anova(wine.new)AnalysisofVarianceTableResponse:AlcoholDfSumSqMeanSqFvaluePr(>F)Cultivar270.79535.397148.2546<2.2e-16***Malic.acid10.0130.0130.05520.8146Alkalinity.ash10.2290.2290.95770.3292Proanthocyanins10.2240.2240.93840.3341Color.intensity14.7504.75019.89421.488e-05***OD.Ratio10.0310.0310.12840.7206Proline10.2620.2621.09760.2963Residuals16940.3510.239---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1

The summary display contained other useful information, so you shouldn'thesitate to look at both.Comparing these results to our previous regression, we can see that onlyone variable (Color.intensity) is still significant, and the effect of Cultivar is very significant. For this data set, itmeans that while we can use the chemical composition to help predict the Alcohol content of the wines, but that knowing the Cultivarwill be more effective. Let's look at a reduced model that uses only Cultivar and Color.intensity to see how it compares withthe model containing the extra variables:

>wine.new1=lm(Alcohol~Cultivar+Color.intensity,data=wine)>summary(wine.new1)Call:lm(formula=Alcohol~Cultivar+Color.intensity,data=wine)Residuals:Min1QMedian3QMax-1.12074-0.32721-0.041330.347991.54962Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)13.148450.1487188.417<2e-16***Cultivar2-1.202650.10431-11.530<2e-16***Cultivar3-0.792480.10495-7.5512.33e-12***Color.intensity0.107860.024344.4321.65e-05***---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1Residualstandarderror:0.4866on174degreesoffreedomMultipleR-Squared:0.6468,AdjustedR-squared:0.6407F-statistic:106.2on3and174DF,p-value:<2.2e-16

The adjusted R-squared for this model is better than that of the previousone, indicating that removing those extra variables didn't seem to cause any problems. To formally test to see if there is a difference between the two models, we can use the anova function. When passeda single model object, anova prints an ANOVA table, but when it's passedtwo model objects, it performs a test to compare the two models:

>anova(wine.new,wine.new1)AnalysisofVarianceTableModel1:Alcohol~Cultivar+Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+ProlineModel2:Alcohol~Cultivar+Color.intensityRes.DfRSSDfSumofSqFPr(>F)116940.351217441.207-5-0.8560.71740.6112

The test indicates that there's no significant difference between the twomodels.When all the independent variables in our model were categorical model (the race/activity example),the interactions between the categorical variables was one of the most interesting parts of the analysis. What does an interaction between a categorical variable and a continuous variable represent? Such an interactioncan tell us if the slope of the continuous variable is different for the different levels of the categorical variable. In the current model, we cantest to see if the slopes are different by adding the term Cultivar:Color.intensity to the model:

>anova(wine.new2)AnalysisofVarianceTableResponse:AlcoholDfSumSqMeanSqFvaluePr(>F)Cultivar270.79535.397149.6001<2.2e-16***Color.intensity14.6524.65219.66131.644e-05***Cultivar:Color.intensity20.5090.2551.07660.343Residuals17240.6980.237---Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1

There doesn't seem to be a significant interaction.File translated fromT_EXby T_TH,version 3.67.
On 27 Apr 2011, 16:10.