I'm after some direction on the fluctuation of data input to Splunk, any help is greatly appreciated. I know this might seem a bit long, but I know some of you guys like a challenge 😉
In short we have a licence for 500MB a day. Some quick maths on various sample of data show that our avg input is approx 320-380MB. My problem is that we are getting spikes of up to 1.3GB, which regularly upsets the licensing. I appreciate that some days some servers are more active, and that some logs for one of the inputs might not be copied across on time (i.e. counted in the next days logs effectively doubling the input), so not every day will index the same. But I'm struggling to find out why this it is so extreme.
This brief example is taken from the daily email we receive and gives a good example.
22 13111 12:00:00.000 AM 196.035698
23 13011 12:00:00.000 AM 386.101424
24 1/29/11 12:00:00.000 AM 514.113445
25 1/28/11 12:00:00.000 AM 486.124525
26 1/27/11 12:00:00.000 AM 366.082153
27 1/26/11 12:00:00.000 AM 493.126798
28 1/25/11 12:00:00.000 AM 452.111828
29 1/24/11 12:00:00.000 AM 195.035355
30 1/23/11 12:00:00.000 AM 282.080423
31 1/22/11 12:00:00.000 AM 516.137922
32 1/21/11 12:00:00.000 AM 1355.257701
33 1/20/11 12:00:00.000 AM 1035.178305
34 1/19/11 12:00:00.000 AM 493.123583
35 1/18/11 12:00:00.000 AM 371.061438
36 1/17/11 12:00:00.000 AM 181.032773
37 1/16/11 12:00:00.000 AM 190.040985
38 1/15/11 12:00:00.000 AM 391.090779
I know the avg here is 464 however this includes our biggest spike to date.
My first problem is that I cannot verify these results.
My second is that I cannot get a confirmed breakdown on where they come from, i.e. is it one host that is spiking, or a particular source, or are all the logs busier on one day.
The spikes are sometimes two together, sometimes single. There does not appear to be any relation to the day of the week for example where Monday is busier due to lack of activity over the weekend. Nor are extreme highs particularly followed by extreme lows.
Notes:
Splunk 4.1.6
Splunk is currently setup for:
46 files/directories
1 UDP source (though I'm not sure this is in use)
2 scripts
Folders are remotely mounted using CIFS
All files/directory inputs contain log files, one per day. After 14 days a script zips these whilst keeping the file name part the same. This will technically create a second file, but contains the same data that was logged/indexed 14 days ago. Most inputs were originally configured to watch for (log|zip)$ however I have changed this to only (log)$ to cut down on duplicate logging, there was no noticeable change in the daily logging amount.
I admit I'm no Splunk professional 🙂 , and as such started off with some bash, but still to no avail. Here are some things I have so far tried...
--
Go through the inputs.conf file, parse for the folders, look through these folders for files in order of age, and then print each one in date order showing its size. This will not show the cumulative size, but should show the size of day to day variations for the same log source. However it shows that for each folder files remain a similar size each day.
grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1 | sed 's/^[monitor:///g;s/.$//g' | grep "^//opt" | while read BOB ; do echo "------$BOB-----" >> ./sizes ; ls -ltr $BOB | awk '{d=c ; c=b ; b=a ; a=int($5/1024/1024) ; print (a-((b+c+d)/3)) }' >> ./sizes ; done
--
This query shows me the size of each folder for each input. I use this to find the three biggest folders then I changed the logging/indexing to only look at log files, however as mentioned above in Notes: this made no difference
grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1 | egrep -v "disabled|^--$" | sed 's/^\[monitor:\///g;s/..$//g' | grep opt | xargs du -s | sort -n
--
This bash line looks through the input folders, then for each folder gets the total file size for files modified that day. I then compared this to the daily email from the splunk app
for j in " 5" " 6" " 7" " 8" " 9" "10" ; do echo $j ; grep "disabled = false" /opt/splunk/etc/system/local/inputs.conf -B1 | egrep -v "disabled|^--$" | sed 's/^\[monitor:\///g;s/..$//g' | grep opt | while read i ; do ls -lt $i | grep "Feb $j" | awk '{print $5}' ; done | awk '{ for ( i=1 ; i<=NF ; i=i+1 ) sum = sum + $i} ; END { print (sum/1024/1024)"MB" }' ; done
I've listed some output against the value reported by the Splunk daily report
date - my script total - Splunk report
5 - 91.2574MB - 532
6 - 34.9325MB - 218
7 - 143.186MB - 142
8 - 173.935MB - 271
9 - 78.0626MB - 283
10 - 270.455MB - 632
...not even close.
--
I then decided to use Splunk
index="_internal" source="*metrics.log" per_source_thruput | mvexpand kb | eval counter=1 | rex mode=sed field=series "s/[a-zA-Z0-9\.]*$//g" | streamstats sum(counter) as instance by series | rex mode=sed "s/\/[^\/]*$//g" | chart first(kb) as kb by date_mday series
If you run the 'show report' on this and make it stacked, its seems to show something that LOOKS like it's right, a stacked chart with each days summary and a breakdown of that for each log. However it's not accurate at all, neither in actual values nor shape of the graph.
--
I've installed the 'Splunk Licence Usage' application... which looks great, however I can't alter any of the charts to breakdown per source per day to show which is causing the huge spikes (and for the moment, though this is the first time I've seen this, some days are missing from the stacked chart, as though there was nothing for that day).
--
I've grepped through the metrics logs folder for breakdowns, (/opt/splunk/var/log/splunk/) however I'm not entirely sure what this means as some logs are 'kb=4.415039'... 4415 bytes and 0.030 of a byte? I've ignored this problem and done the maths... still not adding up.
--
Maybe my biggest query at this point is... the saved search 'Log Index Report (30 Days)' which emails us every day is the value used by splunk to verify if we have exceeded on our licence:
index=_internal todaysBytesIndexed LicenseManager-Audit NOT source=*web_service.log | eval Daily_Indexing_Volume_in_MBs = todaysBytesIndexed/1024/1024 | timechart avg(Daily_Indexing_Volume_in_MBs) by host|sort -_time
How does it arrive at the value for each of these results 'totalBytesIndexed'. If this is the official value how do I tell a) how this figure is calculated b) what is the breakdown of this figure per source (not source type, I need to know which server(s) is wildly fluctuating)? I have found nothing to show an accurate breakdown of data which when tallied matches even closely the amount that we are told.
I have tried other queries unfortunately I've not recorded all of them so can't repeat them here.
Any ideas suggestions are very very welcome as I've spent over 15 hours trying to track this down so far and have got nowhere. 😞
Many Thanks
Andy
... View more