Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel 2003 (829215)



The information in this article applies to:

  • Microsoft Office Excel 2003

SUMMARY

This article describes numeric improvements in each of the three Analysis ToolPak ANOVA tools and illustrates inaccurate results in Excel 2002 and earlier in extreme situations.

MORE INFORMATION

Many functions require the calculation of the sum of squared deviations about a mean. To do so accurately, Excel 2003 uses a two-pass procedure that finds the mean on the first pass, and then calculates squared deviations about the mean on the second pass. In precise arithmetic, the same result occurs in earlier versions of Excel that use the "calculator formula" (so named because it was in widespread use when statisticians used calculators instead of computers). With the calculator formula, earlier versions sum the squares of the observations, and then subtracts from this total the following quantity:
((sum of observations)^2) / number of observations
This calculation occurs in a single pass through the data.

In finite precision arithmetic, the calculator formula is subject to roundoff errors in extreme cases. Excel 2002 and earlier use the calculator formula for most functions that require a sum of squared deviations about a mean (such as VAR, STDEV, SLOPE, and PEARSON). However, these versions of Excel also use the more numerically robust two-pass procedure for the CORREL, COVAR, and DEVSQ functions.

Experts in statistical computing recommend not using the calculator formula. The calculator formula is presented as "how not to do it" in texts about statistical computing. Unfortunately, all three of the Analysis ToolPak (ATP) ANOVA tools make widespread use of the calculator formula or an equivalent single pass approach in Excel 2002 and earlier.

Excel 2003 uses the two-pass procedure for all three ATP ANOVA models. This article discusses these computational improvements in ATP's three ANOVA models:
  1. Single Factor
  2. Two-Factor with Replication
  3. Two-Factor without Replication
This article discusses these models later.

Because Excel has always used the two-pass procedure with DEVSQ, this article makes frequent use of it to describe the improved procedures. These revised procedures either effectively call DEVSQ or use code whose functionality is exactly the same as DEVSQ's functionality.

For each ANOVA tool, ATP output contains a Summary table with values of Count, Sum, Average, and Variance, and an ANOVA table that has various sums of squares and values of SS, df, MS, F and P-value. Results in the summary table are calculated by calling Excel functions COUNT, SUM, AVERAGE, and VAR. Of these four functions, only VAR is subject to roundoff errors. Excel 2002 and earlier versions implement VAR with the calculator formula. The article about VAR describes the improvements for Excel 2003 and permits the reader to experiment with numeric data to see when roundoff errors are likely to occur in earlier versions.

For additional information about VAR, click the following article number to view the article in the Microsoft Knowledge Base:

826112 Excel Statistical Functions: VAR



As it discusses the three ANOVA models, this article focuses on the ANOVA output tables. In each case, the Summary tables are well-behaved in Excel 2003. In Excel 2002 and earlier, problems occur in the Variance column when data have extreme values. However, this article includes the Summary tables in the model sections because they are useful for comparison purposes when you review the modified examples in the Appendix.

Model 1: Single Factor

This is a simple example with data.
ANOVA 1 BASIC MODEL:
123
244
365
486
57
68
Anova: Single Factor
SUMMARY
GroupsCountSumAverageVariance
Column 16213.53.5
Column 242056.666667
Column 36335.53.5
ANOVA
Source of VariationSSdfMSFP-valueF crit
Between Groups12.7526.3751.5068180.2578973.805567
Within Groups55134.230769
Total67.7515
Excel 2002 and earlier use the following pseudocode to calculate the sums of squares:
GrandSum = 0;
GrandSumOfSqs = 0; 
GrandSampleMeanSqrd = 0; 
GrandMeanSqrd = 0; 
GrandSampleSize = 0;

For s = 1 to Number_of_Samples do
   GrandSum = GrandSum + sum of observations in s-th sample;
   GrandSumOfSqs = GrandSumOfSqs + sum of squared observations in s-th sample;
   GrandSampleMeanSqrd = GrandSampleMeanSqrd  +
      (sum of observations in s-th sample^2)/size of s-th sample;
   GrandSampleSize = GrandSampleSize + size of s-th sample
Endfor;

GrandMeanSqrd = (GrandSum^2) / GrandSampleSize;

TotalSS = GrandSumOfSqs - GrandMeanSqrd;
BetweenGroupsSS = GrandSampleMeanSqrd - GrandMeanSqrd;
WithinGroupsSS = GrandSumOfSqs - GrandSampleMeanSqrd;
This approach is essentially the calculator formula: compute the sums of squares of observations, and then subtract a quantity from them, just as VAR computes the sum of squares of the observations, and then subtracts sum of observations^2/sample size. Similar pseudocode for the model 2 and model 3 has been omitted. Again, for model 2 and model 3, sums of squares are calculated and a quantity is subtracted from the sum of squares as in the calculator formula. Unfortunately, basic statistics texts frequently suggest approaches for ANOVA such as the one shown earlier in this article.

Excel 2003 uses a different approach to calculate the various entries in the SS column of the ANOVA table. For illustration, this article assumes that the numeric data in the earlier example appear in cells A2:C7 with missing data in cells B6 and B7:
  • Total SS is just DEVSQ applied to all the data, such as DEVSQ(A2:C7). DEVSQ works correctly even though data is missing.
  • Between Groups SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(A2:A7) + DEVSQ(B2:B7) + DEVSQ(C2:C7).
  • Within Groups SS is Total SS minus Between Groups SS.
If entries in the SS column of the ANOVA table are calculated correctly, the accuracy of the other entries in the table follow.

Model 2: Two-Factor with Replication

Here is a simple example with data.
ANOVA 2 BASIC MODELgroup 1group 2group 3
trial 1123
244
365
trial 2486
5107
6128
Anova: Two-Factor With Replication
SUMMARYgroup 1group 2group 3Total
trial 1
Count3339
Sum6121230
Average2443.333333
Variance1412.5
trial 2
Count3339
Sum15302166
Average51077.333333
Variance1416.25
Total
Count666
Sum214233
Average3.575.5
Variance3.5143.5
ANOVA
Source of VariationSSdfMSFP-valueF crit
Sample72172366.22E-054.747221
Columns37218.59.250.0037093.88529
Interaction924.52.250.1479733.88529
Within24122
Total14217
Again, if entries in the SS column are calculated correctly, the accuracy of all the other entries in the ANOVA part of the output follow.

Here is the Excel 2003 computational procedure. It uses DEVSQ to calculate the various entries in the SS column of the ANOVA table. For illustration, this example assumes that the numeric data appear in cells B2:D7.
  • Total SS is just DEVSQ applied to all the data, such as DEVSQ(B2:D7).
  • Sample SS is Total SS minus the sum of DEVSQ applied to each sample, such as DEVSQ(B2:D4) + DEVSQ(B5:D7).
  • Columns SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(B2:B7) + DEVSQ(C2:C7) + DEVSQ(D2:D7).
  • Within SS is the sum of DEVSQ applied to each trial or group pair, such as DEVSQ(B2:B4) + DEVSQ(C2:C4) + DEVSQ(D2:D4) + DEVSQ(B5:B7) + DEVSQ(C5:C7) + DEVSQ(D5:D7).
  • Interaction SS equals Total SS minus Sample SS minus Columns SS minus Within SS.

Model 3: Two-Factor without Replication

Here is a simple example with data.
ANOVA 3 BASIC MODEL:LOW MED HI
POOR123
244
365
MID CLASS486
5107
6128
RICH71410
8126
9102
Anova: Two-Factor Without Replication
SUMMARYCountSumAverageVariance
POOR3621
3103.3333331.333333
3144.6666672.333333
MID CLASS31864
3227.3333336.333333
3268.6666679.333333
RICH33110.3333312.33333
3268.6666679.333333
321719
LOW 94557.5
MED 9788.66666716
HI9515.6666676.25
ANOVA
Source of VariationSSdfMSFP-valueF crit
Rows176.6667822.083335.760870.0014762.591094
Columns68.66667234.333338.9565220.0024553.633716
Error61.33333163.833333
Total306.666726
If the values in the SS column are calculated correctly, the accuracy of all the other values in the ANOVA table follows.

Excel 2003 uses the following computational procedure. The procedure uses DEVSQ to calculate the values in the SS column of the ANOVA table. For illustration, this example assumes that the range of cells shown in the earlier example is cells A1:D10. Therefore, the numeric data appear in cells B2:D10.
  • Total SS is just DEVSQ applied to all the data, such as DEVSQ(B2:D10).
  • Rows SS is Total SS minus the sum of DEVSQ applied to each row, such as DEVSQ(B2:D2) + DEVSQ(B3:D3) + DEVSQ(B4:D4) + DEVSQ(B5:D5) + DEVSQ(B6:D6) + DEVSQ(B7:D7) + DEVSQ(B8:D8) + DEVSQ(B9:D9) + DEVSQ(B10:D10).
  • Columns SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(B2:B10) + DEVSQ(C2:C10) + DEVSQ(D2:D10).
  • Error SS is Total SS minus Rows SS minus Columns SS.

Results in Earlier Versions of Excel

In extreme cases where there are many significant digits in the data but also a small variance, the calculator formula leads to inaccurate results. The Appendix later in this article gives examples of roundoff problems in such extreme situations.

Results in Excel 2003

Excel 2003 uses a procedure that makes two passes through the data. On the first pass, Excel 2003 calculates the sum and count of the data values. From these Excel can calculate the sample mean (average). On the second pass, Excel calculates the squared difference between each data point and the sample mean, and then sums these squared differences. As a result, the results in Excel 2003 are more stable numerically.

Conclusions

A two-pass approach improves the numeric performance in all three ATP ANOVA tools in Office Excel 2003 as compared to earlier versions. Office Excel 2003 results are never less accurate than results in earlier versions.

In most practical cases, however, there is no difference between Office Excel 2003 results and results in earlier versions of Excel because typically, data do not exhibit the kind of unusual behavior that the following Appendix illustrates. numeric instability is most likely to occur in earlier versions of Excel when data contains a high number of significant digits with relatively little variation between data values.

If you use an earlier version of Excel and want to see whether Excel 2003 gives you different ANOVA results, compare the results of using the ANOVA tools in your earlier version with the results of the procedures that use DEVSQ that are described earlier in this article for the ANOVA table associated with each of the tools. To verify that Variances are correct in the Summary table for each range, use DEVSQ(range)/(COUNT(range) - 1).

Appendix: numeric Examples of Excel 2002 and Earlier Performance

For each basic example from models 1, 2, and 3, this article presents the ATP tool's output, including both the Summary and ANOVA tables, earlier. Data was modified in each example to create a "stressed" example. This is done by adding 10^8 to each data value. Adding a constant such as 10^8 to each data value does not affect Variance in the Summary table (but will affect Average and Sum in obvious ways). It should also not affect any entry in the ANOVA table.

If you compare Variances in the Summary tables and SS in the ANOVA tables, you will notice that all of these are incorrectly calculated in all three of the following stressed models except for one entry in model 3 that is pointed to with "<---". In all the stressed cases, Excel 2003 ANOVA results agree with the earlier results in the basic cases (as they should).

ANOVA 1 STRESSED MODEL WITH LARGE DATA VALUES

100000001100000002100000003
100000002100000004100000004
100000003100000006100000005
100000004100000008100000006
100000005100000007
100000006100000008
Anova: Single Factor
SUMMARY
GroupsCountSumAverageVariance
Column 166000000211E+084.8
Column 244000000201E+088
Column 366000000331E+081.6
ANOVA
Source of VariationSSdfMSFP-valueF crit
Between Groups020013.805567
Within Groups64134.923077
Total6415

ANOVA 2 STRESSED MODEL WITH LARGE DATA VALUES

group 1group 2group 3
trial 1100000001100000002100000003
100000002100000004100000004
100000003100000006100000005
trial 2100000004100000008100000006
100000005100000010100000007
100000006100000012100000008
Anova: Two-Factor With Replication
SUMMARYgroup 1group 2group 3Total
trial 1
Count3339
Sum3000000063000000123000000129E+08
Average1000000021000000041000000041E+08
Variance0404
trial 2
Count3339
Sum3000000153000000303000000219E+08
Average1000000051000000101000000071E+08
Variance0406
Total
Count666
Sum600000021600000042600000033
Average100000004100000007100000005.5
Variance4.814.41.6
ANOVA
Source of VariationSSdfMSFP-valueF crit
Sample64164240.0003674.747221
Columns3221660.0156253.88529
Interaction3221660.0156253.88529
Within32122.666666667
Total12817

ANOVA 3 STRESSED MODEL WITH LARGE DATA VALUES

LOW MED HI
POOR100000001100000002100000003
100000002100000004100000004
100000003100000006100000005
MID CLASS100000004100000008100000006
100000005100000010100000007
100000006100000012100000008
RICH100000007100000014100000010
100000008100000012100000006
100000009100000010100000002
Anova: Two-Factor Without Replication
SUMMARYCountSumAverageVariance
Row 133000000061000000020
Row 233000000101000000032
Row 333000000141000000052
Row 433000000181000000064<---
Row 533000000221000000076
Row 6330000002610000000910
Row 7330000003110000001012
Row 8330000002610000000910
Row 9330000002110000000718
Column 199000000451000000058
Column 2990000007810000000914
Column 399000000511000000064
ANOVA
Source of VariationSSdfMSFP-valueF crit
Rows12881620.1132812.591094
Columns3221620.1677723.633716
Error128168
Total28826

Modification Type:MinorLast Reviewed:1/11/2006
Keywords:kbfunctions kbProgramming kbfuncstat kbinfo KB829215 kbAudEndUser