Hello,
I was using Splunk to crawl my Apache logs and I found something rather odd analyzing the mean page size served. To better qualify this, I limited the search results to only lines that had a value for bytes served.
Splunk told me that I was serving a mean page size of 33.3 KB which is very reasonable and a range of 1-~5 MB. The standard deviation, however, was 217.9 KB.
This strikes me as odd because it means there's roughly a 44% probability [z(0) = -0.1526] that I'm serving a negative page size which is not possible especially given that I limited the data set to non-zero values. I understand it could be statistically possible, but I'd expect a pagesize of 0 to reside at least 2 standard deviations from the mean.
Any insights?
Making strong inferences like that about standard deviation only works when the data set follows the normal distribution. In cases like this where the data really doesnt follow a normal distribution at all, standard deviation becomes little more than a heuristic.
UPDATE:
if you want to look at the distribution yourself you can to to the 'Advanced Charting' view and run this search:
<your search> bytes=* | chart count over bytes bins=300
There are probably a significant number of outliers at a very high number of bytes, and that's what's skewing your distribution. On my system I have to throw in a term that says bytes<200000000
because I have enough outliers at the crazy-high end to completely throw off the chart.
at any rate, unless the chart literally looks like the curve of normal distribution, ( http://www.google.com/search?q=normal+distribution&hl=en&rlz=1C1CHFX_enUS396US396&prmd=ivns&tbm=isch... ) then those probabilistic statements about standard deviation will not be 'true'. When applied outside of normally distributed data, actually quite a lot of common statistics lose their meaning.
Hi Nick,
I would expect this to be somewhat of a normal distribution (I failed to mention the range struck me as odd as well). The logs on which Splunk chewed are for the webserver only -- meaning only HTML (and headers for redirects, 404s, etc) are being served. We use Amazon CloudFront to distribute static assets including images, video, scripts, and stylesheets. The CDN origin logs to a different file and I did not import the origin logfiles to avoid skewing the data.
The one-week logfile contained more than 1.5m lines and is the very reason I've never tried to analyze this by hand 🙂