In our previous post, we have discussed what is normal distribution and how to visually identify the normal distribution. However, deeper analysis is require to validate the normality of the data since it is affecting our analysis method. Most us are relying to our advance statistical software such as Minitab, SigmaXL, JMP and many more to validate the data normality. In this post, we will share on normality test using Microsoft Excel.
For the example of the normality test, we’ll use set of data below.
- Creating a histogram using the Analysis ToolPak generates a chart and a data table, as seen below to get the ‘Frequency’ of the ‘Bin’ (Bin size is determined by the analyst)
- Use the Descriptive Statistics option in the Analysis ToolPak to quickly generate descriptive statistics for your data set in Sheet 1.
- Select Data > Data Analysis > Descriptive Statistics
- Click OK
- Click in the Input Range box and select your input range using the mouse.
- In this case, the data is grouped by columns. In most statistical analysis, that will be the case, but if you have data grouped by rows, you should change the Grouped By selection.
- Select to output information in a new worksheet.
- Ensure at least the Summary statistics box is checked. You can also check the Confidence level for mean and the Kth largest and smallest boxes, though that information isn’t required in the Chi-Squared Goodness-of-Fit test, which is the test we are running to test for normality of the data. If you check these extra boxes, Excel will simply provide you with additional information that we won’t be using at this time.
- Click OK.
- Excel returns descriptive summary statistics for your data set in Sheet 3. The information provided are slightly similar to information in Minitab Graphical Summary
For the Chi-Squared Goodness-of-Fit test, you will need to note the sample size (or count), the same standard deviation, and the sample mean. The sample size is the number of items in the data set, which was 50 for this example. If you don’t remember what the sample size was, you can refer to the count listed in the descriptive statistics.
Setup the Hypothesis
The Chi-Squared Goodness-of-Fit test is actually a hypothesis test. That means you are testing the data with regard to a null hypothesis and an alternative hypothesis. The two hypotheses for the Chi-Squared Goodness-of-Fit test are:
- Null hypothesis (Ho): The data is normal.
- Alternative hypothesis (Ha): The data is not normal.
If one is not true, then the other is. In statistical terms, we talk in terms of accepting or rejecting the null hypothesis. If we reject the null, we accept the alternative.
Understand the Chi-Squared Goodness-of-Fit test premise.
Basically, the Chi-Squared Goodness-of-Fit test takes the number of samples in each bin on the histogram and compares that to the number of samples you might expect to find in each bin given a normal curve. Using the actual number of samples in each bin and the expected number of samples, we can calculate what is called the Chi-Square Statistic in Excel. That number then lets us calculate a p-Value. In this case, it is the size of the p-Value that lets us decide whether to accept or reject the hypothesis that the data is normal.
For the purpose of the Chi-Squared Goodness-of-Fit test in this situation, if the p-Value is greater than 0.05, we will accept the null hypothesis that the data is normally distributed.
The Observed Bins
Having created a histogram via the Analysis ToolPak, you already have access to the observed bin distribution. That information is housed in the data table Excel (Sheet 2) creates to make the histogram (refer blue histogram image above)
The Expected Bins
We can use statistics related to the normal curve to calculate how we might expect bins to behave given the median and standard deviation of our sample.
To give you an idea of what is going on with the statistical calculations involved in determining expected size of bins, consider the graphic below.
This graphic roughly depicts the bins from our histogram drawn on the normal curve. Because mathematical formulations exist for determining the area under a curve, it’s possible to determine the area under the curve within a specific bin. Ultimately, that is done by calculating the total area and subtracting portions.
Creating Chi Squared Goodness Fit to Test Data Normality
- We begin with a calculation known as the Cumulative Distribution Function, or CDF. The CDF measures the total area under a curve to the left of the point we are measuring from. For example, the total area under the curve above that is to the left of 45 is 50 percent. Once we know the CDF at each border of our bins, it’s a matter of subtraction to calculate the CDF for each individual bin. For example, the CDF for the bin located between 40 and 45 would equal the CDF of 45 minus the CDF of 40.
One problem with this rough depiction is that the curve drawn above centers on 45, and we know from Excel that our mean is 48.778. We’ll use that number in our calculations to account for the slight shift.
Excel can calculate CDF with the formula:
=NORDIST(x value, Sample Mean, Sample Standard Deviation, TRUE)
- Set up the tables for calculating the CDF of each bin by copying the bin designations onto the descriptive statistics worksheet that Excel previously created for you and creating two columns, one for total CDF and one for bin CDF. Use the image below as an example.
- Enter the formula for calculating CDF into column E, referencing the same mean and standard deviation for each row and using the numbers in D as X.
- For the first row – in our case, the bin marked 10 — the bin-only area is equal to the CDF because there is nothing left of the bin’s upper limit. For all other rows, the bin-only area is the CDF minus the CDF for the bin designation above.
- So, you would enter =E2 in the first data row for column F.
- The second data row would be calculated as E3-E2; the next would be E4-E3, and so forth. The result is the percentage of the curve in each bin.
- Calculating the expected number of samples in each bin is as easy as multiplying the percentages of each bin by the sample size. Again, you can see from the descriptive statistics that the count for this set of data was 50.
- To calculate the Chi-Squared statistic, you’ll use both the expected number of items in each bin and the actual or observed number. Copy the observed numbers over from your histogram worksheet.
- Apply the following formula to each row and calculate the final numbers for each row as desired in Excel.
(expected – observed)2 / expected
- Add up the final numbers to get the Chi-Squared statistic, denoted by X. For our example, X is 18.9168
- To use the Chi-Squared statistic to find the p-Value, we also need one more item for the Excel formula to work: we need what is called the degrees of freedom.
Degrees of freedom = #bins – 1 – #calculated parameters
We have 14 bins. The parameters we used to arrive at the Chi-Squared statistic that we calculated from our sample were the mean and standard deviation: two parameters. For our example:
Degrees of freedom = 14 – 1 – 2 = 11
- Now that we have both the degrees of freedom (df), and the Chi-Squared value, we can use Excel to calculate the p-Value. Simply enter the formula below, inputting the correct values.
In the case of our example, the resulting p-Value is 0.062. Because the p-Value is greater than 0.05, we accept the null hypothesis (Ho). Our data is normal.