Getting Data In

## Descriptive Statistics from Splunk

Splunk Employee

Hello,

I was using Splunk to crawl my Apache logs and I found something rather odd analyzing the mean page size served. To better qualify this, I limited the search results to only lines that had a value for bytes served.

Splunk told me that I was serving a mean page size of 33.3 KB which is very reasonable and a range of 1-~5 MB. The standard deviation, however, was 217.9 KB.

This strikes me as odd because it means there's roughly a 44% probability [z(0) = -0.1526] that I'm serving a negative page size which is not possible especially given that I limited the data set to non-zero values. I understand it could be statistically possible, but I'd expect a pagesize of 0 to reside at least 2 standard deviations from the mean.

Any insights?

Tags (1)
SplunkTrust

Making strong inferences like that about standard deviation only works when the data set follows the normal distribution. In cases like this where the data really doesnt follow a normal distribution at all, standard deviation becomes little more than a heuristic.

UPDATE:
if you want to look at the distribution yourself you can to to the 'Advanced Charting' view and run this search:

``````<your search> bytes=* | chart count over bytes bins=300
``````

There are probably a significant number of outliers at a very high number of bytes, and that's what's skewing your distribution. On my system I have to throw in a term that says `bytes<200000000` because I have enough outliers at the crazy-high end to completely throw off the chart.

at any rate, unless the chart literally looks like the curve of normal distribution, ( http://www.google.com/search?q=normal+distribution&hl=en&rlz=1C1CHFX_enUS396US396&prmd=ivns&tbm=isch... ) then those probabilistic statements about standard deviation will not be 'true'. When applied outside of normally distributed data, actually quite a lot of common statistics lose their meaning.

Splunk Employee

Hi Nick,

I would expect this to be somewhat of a normal distribution (I failed to mention the range struck me as odd as well). The logs on which Splunk chewed are for the webserver only -- meaning only HTML (and headers for redirects, 404s, etc) are being served. We use Amazon CloudFront to distribute static assets including images, video, scripts, and stylesheets. The CDN origin logs to a different file and I did not import the origin logfiles to avoid skewing the data.

The one-week logfile contained more than 1.5m lines and is the very reason I've never tried to analyze this by hand 🙂

Get Updates on the Splunk Community!

#### Registration for Splunk University is Now Open!

Are you ready for an adventure in learning?   Brace yourselves because Splunk University is back, and it's ...

#### Splunkbase | Splunk Dashboard Examples App for SimpleXML End of Life

The Splunk Dashboard Examples App for SimpleXML will reach end of support on Dec 19, 2024, after which no new ...

#### Understanding Generative AI Techniques and Their Application in Cybersecurity

Watch On-Demand Artificial intelligence is the talk of the town nowadays, with industries of all kinds ...