Friday, January 19, 2007

Standard Deviation When lnterpreting Web Analytics


One of my favorite blogs, Good Math, Bad Math has an excellent article describing standard deviation as it relates to the mean of the data.

The mean is more commonly called the average. It's calculated by the sum of the total data points in the population, then divided by the number of data points. A simple example of that would be the data set: (1, 2, 3, 4, 5). The sum of this population equals 15. There are 5 data points in this population, so the mean would be calculated as 15/5=3.

A fancier way to put it would be the following formula:



We can see through web analytics the average visitors per period of time... daily, weekly, monthly. However, measuring by mean alone can be deceptive. The mean doesnt describe some of the more important data sets that are important in determining the meaning of analytics.

The mean wont give you information on the low points, the high points, nor will they tell you the relationship of the mean between the rest of the data. In the earlier data set, the relationship between the data was very easy to determine. In web analytics, those relationships can be a little trickier.

I'll take an example that's near to me. My own web analytics.

Currently, I average 42 page views per day. This means that 42 of my unique pages are viewed... this is not a site visit. My low point is 4 page views in a single day and my highest is 124.

From this, we can tell that there was most likely a spike in my page views at some point. Because the mean is less than twice the largest data point, we can automatically start with that presumption. However, in order to get more information, we must take into account the standard deviation. It is defined as a measure of the spread of its values also, the square root of the variance. (From Wikipedia)

Each differently colored area is the standard deviation. Each section is the same length, but not the same area under the curve. This means that within one standard deviation of the mean, most of the data falls under those data points.

In my case, my standard deviation is calculated as 9.4.

What this means, is that using Chebyshev's Inequality rule,

At least 50% of the values are within 1.4 standard deviations from the mean.
At least 75% of the values are within 2 standard deviations from the mean.
At least 89% of the values are within 3 standard deviations from the mean.
At least 94% of the values are within 4 standard deviations from the mean.
At least 96% of the values are within 5 standard deviations from the mean.
At least 97% of the values are within 6 standard deviations from the mean.
At least 98% of the values are within 7 standard deviations from the mean.
At least 1 - 1/k2 of the values are within k standard deviations from the mean.
When you apply this information to web analytics, one of the things I do is look at the geographic distribution of the users. When I find hubs of higher consumer acitivity, I start getting a clearer idea to who my users are. This could help me target my paid search campaign more accurately, this could let me know that if I provide content, analysis or a blog, a nice mention of something applicable and interesting in their area might be appropriate.

The standard deviation is a powerful method to segment your analytics into greater specificity. When Chebyshev's Inequality shows you that 75% of the data is within two standard deviations, then you have some focused and applicable data to improve your messaging and targeting.

2 comments:

Author said...

Standard Deviation

Standard Deviation Calculator

Naomi said...

How do you get the web analytics program to give you the standard deviation for the metrics they report? For instance, Google Analytics reports average time on site for any given time period, but doesn't report the standard deviation, as far as I can tell. Do you know of any package that does?