Solved: Log Index Report vs Me

ecscltd · ‎02-24-2011

I'm after some direction on the fluctuation of data input to Splunk, any help is greatly appreciated. I know this might seem a bit long, but I know some of you guys like a challenge 😉

In short we have a licence for 500MB a day. Some quick maths on various sample of data show that our avg input is approx 320-380MB. My problem is that we are getting spikes of up to 1.3GB, which regularly upsets the licensing. I appreciate that some days some servers are more active, and that some logs for one of the inputs might not be copied across on time (i.e. counted in the next days logs effectively doubling the input), so not every day will index the same. But I'm struggling to find out why this it is so extreme.

This brief example is taken from the daily email we receive and gives a good example.

22 13111 12:00:00.000 AM 196.035698
23 13011 12:00:00.000 AM 386.101424
24 1/29/11 12:00:00.000 AM 514.113445
25 1/28/11 12:00:00.000 AM 486.124525
26 1/27/11 12:00:00.000 AM 366.082153
27 1/26/11 12:00:00.000 AM 493.126798
28 1/25/11 12:00:00.000 AM 452.111828
29 1/24/11 12:00:00.000 AM 195.035355
30 1/23/11 12:00:00.000 AM 282.080423
31 1/22/11 12:00:00.000 AM 516.137922
32 1/21/11 12:00:00.000 AM 1355.257701
33 1/20/11 12:00:00.000 AM 1035.178305
34 1/19/11 12:00:00.000 AM 493.123583
35 1/18/11 12:00:00.000 AM 371.061438
36 1/17/11 12:00:00.000 AM 181.032773
37 1/16/11 12:00:00.000 AM 190.040985
38 1/15/11 12:00:00.000 AM 391.090779

I know the avg here is 464 however this includes our biggest spike to date.

My first problem is that I cannot verify these results.

My second is that I cannot get a confirmed breakdown on where they come from, i.e. is it one host that is spiking, or a particular source, or are all the logs busier on one day.

The spikes are sometimes two together, sometimes single. There does not appear to be any relation to the day of the week for example where Monday is busier due to lack of activity over the weekend. Nor are extreme highs particularly followed by extreme lows.

Notes:

Splunk 4.1.6

Splunk is currently setup for:

46 files/directories

1 UDP source (though I'm not sure this is in use)

2 scripts

Folders are remotely mounted using CIFS

All files/directory inputs contain log files, one per day. After 14 days a script zips these whilst keeping the file name part the same. This will technically create a second file, but contains the same data that was logged/indexed 14 days ago. Most inputs were originally configured to watch for (log|zip)$ however I have changed this to only (log)$ to cut down on duplicate logging, there was no noticeable change in the daily logging amount.

I admit I'm no Splunk professional 🙂 , and as such started off with some bash, but still to no avail. Here are some things I have so far tried...

--

Go through the inputs.conf file, parse for the folders, look through these folders for files in order of age, and then print each one in date order showing its size. This will not show the cumulative size, but should show the size of day to day variations for the same log source. However it shows that for each folder files remain a similar size each day.

grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1 | sed 's/^[monitor:///g;s/.$//g' | grep "^//opt" | while read BOB ; do echo "------$BOB-----" >> ./sizes ; ls -ltr $BOB | awk '{d=c ; c=b ; b=a ; a=int($5/1024/1024) ; print (a-((b+c+d)/3)) }' >> ./sizes ; done

--

This query shows me the size of each folder for each input. I use this to find the three biggest folders then I changed the logging/indexing to only look at log files, however as mentioned above in Notes: this made no difference

grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1  | egrep -v "disabled|^--$" | sed 's/^\[monitor:\///g;s/..$//g' | grep opt | xargs du -s | sort -n

--

This bash line looks through the input folders, then for each folder gets the total file size for files modified that day. I then compared this to the daily email from the splunk app

for j in " 5" " 6" " 7" " 8" " 9" "10" ; do echo $j ; grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1  | egrep -v "disabled|^--$" | sed 's/^\[monitor:\///g;s/..$//g' | grep opt | while read i ; do ls -lt $i | grep "Feb $j" | awk '{print $5}' ; done | awk '{ for ( i=1 ; i<=NF ; i=i+1 ) sum = sum + $i} ; END { print (sum/1024/1024)"MB" }' ; done

I've listed some output against the value reported by the Splunk daily report

date - my script total - Splunk report

5 - 91.2574MB - 532
6 - 34.9325MB - 218
7 - 143.186MB - 142
8 - 173.935MB - 271
9 - 78.0626MB - 283
10 - 270.455MB - 632

...not even close.

--

I then decided to use Splunk

index="_internal" source="*metrics.log" per_source_thruput | mvexpand kb | eval counter=1 | rex mode=sed field=series "s/[a-zA-Z0-9\.]*$//g" | streamstats sum(counter) as instance by series | rex mode=sed "s/\/[^\/]*$//g" | chart first(kb) as kb by date_mday series

If you run the 'show report' on this and make it stacked, its seems to show something that LOOKS like it's right, a stacked chart with each days summary and a breakdown of that for each log. However it's not accurate at all, neither in actual values nor shape of the graph.

--

I've installed the 'Splunk Licence Usage' application... which looks great, however I can't alter any of the charts to breakdown per source per day to show which is causing the huge spikes (and for the moment, though this is the first time I've seen this, some days are missing from the stacked chart, as though there was nothing for that day).

--

I've grepped through the metrics logs folder for breakdowns, (/opt/splunk/var/log/splunk/) however I'm not entirely sure what this means as some logs are 'kb=4.415039'... 4415 bytes and 0.030 of a byte? I've ignored this problem and done the maths... still not adding up.

--

Maybe my biggest query at this point is... the saved search 'Log Index Report (30 Days)' which emails us every day is the value used by splunk to verify if we have exceeded on our licence:

index=_internal todaysBytesIndexed LicenseManager-Audit NOT source=*web_service.log | eval Daily_Indexing_Volume_in_MBs = todaysBytesIndexed/1024/1024 | timechart avg(Daily_Indexing_Volume_in_MBs) by host|sort -_time

How does it arrive at the value for each of these results 'totalBytesIndexed'. If this is the official value how do I tell a) how this figure is calculated b) what is the breakdown of this figure per source (not source type, I need to know which server(s) is wildly fluctuating)? I have found nothing to show an accurate breakdown of data which when tallied matches even closely the amount that we are told.

I have tried other queries unfortunately I've not recorded all of them so can't repeat them here.

Any ideas suggestions are very very welcome as I've spent over 15 hours trying to track this down so far and have got nowhere. 😞

Many Thanks

Andy

sideview · ‎02-24-2011

Have you looked at the 'indexing volume' view that Splunk ships right in the search app? I wrote this view while I was at Splunk and I can tell you a little about it.

In the search app go to 'Status > Indexing Activity > Indexing Volume'.

In that view you can see a breakdown of indexing volume by source, host, sourcetype or index. and you can then drill into particular sources, indexes, sourcetypes etc, to see a timechart to help you understand why. And then you can drill into points on that actual timechart to see sample events that were indexed back during that time.

This can be a very powerful tool for understanding spikes in your volume.

View solution in original post

gkanapathy · ‎02-24-2011

You should be able to run:

index=_internal source=*metrics.log group=per_source_thruput | timechart span=1d limit=0 sum(kb) by series

and see a table and chart it. Similarly, you can replace per_source_thruput with per_host_thruput.

You should also be able to go to "Status", "Index activity", "Indexing volume" page ( http://mysplunkserver:8000/app/search/indexing_volume ) and see this data too.

View solution in original post

gkanapathy · ‎02-24-2011

You should be able to run:

index=_internal source=*metrics.log group=per_source_thruput | timechart span=1d limit=0 sum(kb) by series

and see a table and chart it. Similarly, you can replace per_source_thruput with per_host_thruput.

You should also be able to go to "Status", "Index activity", "Indexing volume" page ( http://mysplunkserver:8000/app/search/indexing_volume ) and see this data too.

ecscltd · ‎02-25-2011

Replacing it with per_host_thruput worked a treat! exactly what I was after!! _ _ _ I'm still interested by the 'Indexing Volume' app, but while I wait for that this query is spot on... for the first time in weeks I can see what is going on lol (and I'm surprised i've managed to miss this one, looks so obvious in hindsight 😉 ) _ _ _ Many Thanks

sideview · ‎02-24-2011

Have you looked at the 'indexing volume' view that Splunk ships right in the search app? I wrote this view while I was at Splunk and I can tell you a little about it.

In the search app go to 'Status > Indexing Activity > Indexing Volume'.

In that view you can see a breakdown of indexing volume by source, host, sourcetype or index. and you can then drill into particular sources, indexes, sourcetypes etc, to see a timechart to help you understand why. And then you can drill into points on that actual timechart to see sample events that were indexed back during that time.

This can be a very powerful tool for understanding spikes in your volume.

ecscltd · ‎02-25-2011

Sorry, Nick, I'm not allowed to accept two answers, nor am I allowed to rate your query 😞 _ _ _ trust me, it's very much appreciated. Andy

ecscltd · ‎02-25-2011

Thanks for the input Nick.
If I write down the values from the 'Index Volume' app I get some answers that match very closely to the email report, avg 1Mb out/day (I've not counted partial MB so I expect a little variance). There is one day that is 198MB out (532 in the email, 334 in this app).

I've tried to drill down to both check the breakdown and investigate the 10th Feb, however unfortunately the licence has once again tripped over again. I've requested a reset licence and when this comes I will try and get more info. So far the app looks very useful, looking forward to getting a key 🙂

carasso · ‎02-24-2011

Hi Olas, Thanks!

Log Index Report vs Me

Upcoming Community Maintenance: 10/28

Best Practices for Metrics Pipeline Management

New Case Study: How LSU’s Student-Powered SOCs and Splunk Are Shaping the Future of ...