Descriptive Statistics from Splunk

pstout · ‎05-12-2011

Hello,

I was using Splunk to crawl my Apache logs and I found something rather odd analyzing the mean page size served. To better qualify this, I limited the search results to only lines that had a value for bytes served.

Splunk told me that I was serving a mean page size of 33.3 KB which is very reasonable and a range of 1-~5 MB. The standard deviation, however, was 217.9 KB.

This strikes me as odd because it means there's roughly a 44% probability [z(0) = -0.1526] that I'm serving a negative page size which is not possible especially given that I limited the data set to non-zero values. I understand it could be statistically possible, but I'd expect a pagesize of 0 to reside at least 2 standard deviations from the mean.

Any insights?

sideview · ‎05-12-2011

Making strong inferences like that about standard deviation only works when the data set follows the normal distribution. In cases like this where the data really doesnt follow a normal distribution at all, standard deviation becomes little more than a heuristic.

UPDATE:
if you want to look at the distribution yourself you can to to the 'Advanced Charting' view and run this search:

<your search> bytes=* | chart count over bytes bins=300

There are probably a significant number of outliers at a very high number of bytes, and that's what's skewing your distribution. On my system I have to throw in a term that says bytes<200000000 because I have enough outliers at the crazy-high end to completely throw off the chart.

at any rate, unless the chart literally looks like the curve of normal distribution, ( http://www.google.com/search?q=normal+distribution&hl=en&rlz=1C1CHFX_enUS396US396&prmd=ivns&tbm=isch... ) then those probabilistic statements about standard deviation will not be 'true'. When applied outside of normally distributed data, actually quite a lot of common statistics lose their meaning.

pstout · ‎05-12-2011

Hi Nick,

I would expect this to be somewhat of a normal distribution (I failed to mention the range struck me as odd as well). The logs on which Splunk chewed are for the webserver only -- meaning only HTML (and headers for redirects, 404s, etc) are being served. We use Amazon CloudFront to distribute static assets including images, video, scripts, and stylesheets. The CDN origin logs to a different file and I did not import the origin logfiles to avoid skewing the data.

The one-week logfile contained more than 1.5m lines and is the very reason I've never tried to analyze this by hand 🙂

Descriptive Statistics from Splunk

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders

Join the Conversation

Descriptive Statistics from Splunk

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders