Hi, We are trying to limit the maxKBps of a couple forwarders to 30 KBps. We are doing this because the app on those servers keeps messing up and logging gigabytes upon gigabytes per hour, which violates our license usage.
In limits.conf in the system local directory I have specified the following stanza:
[thruput]
maxKBps = 30
(30 kbps is more than enough to let this server index their logs at a good pace while not breaking our bank when their app screws up)
Now, when I parse through our log files, I keep getting KBps thruput higher than this.
grep tcpout_connections /lcl/logs/splunk/metrics.log | awk '{print $12}' | tail -17
_tcp_KBps=0.01,
_tcp_KBps=0.01,
_tcp_KBps=0.01,
_tcp_KBps=0.01,
_tcp_KBps=0.07,
_tcp_KBps=0.01,
_tcp_KBps=0.01,
_tcp_KBps=0.01,
_tcp_KBps=0.71,
_tcp_KBps=162.86,
_tcp_KBps=289.25,
_tcp_KBps=284.02,
_tcp_KBps=303.00,
_tcp_KBps=307.52,
_tcp_KBps=307.61,
_tcp_KBps=303.70,
_tcp_KBps=303.26,
This is not an isolated incident, I have seen it shoot up to 912, or 168, etc. It shouldn't go higher than 30 KBps. Any assistance or input in this matter would be appreciated (And no, we dont want to go to a lightweight forwarder, we just want the throughput limited)
Did you enable compression ? I'm pretty sure "tcpout_connections" log lines are showing up figures before compression. And I don't know if maxKBps is limiting the forwarder thruput before or after compression. Did you also try to measure the thruput using other independatn tools like iptables or netstat ?
We're ok with the data taking longer. The problem we're having is that it can use SO much of our logging in a span of 30 minutes (how long in between runs of our monitoring). We know what it's doing and we know what the limitations and drawbacks of our solution is. The only problem we have is that the stanza is NOT working as the documentation says it should be working. I'm going to take this directly to customer service.
Lowell is right, limiting the thruput is not solving the problem and will just cause your forwarders to fall over when their queues fill up. Filtering garbage events and setting alerts for when thruput spikes dangerously is how you should be addressing the problem.
By delaying the amount of data a forwarder can send you are just slowing down the flow of data. The events from the spike will still need to be sent (its just going to take longer for them to get there).
I'm not sure this approach will work fundamentally. If you are consistently producing more logs than you can index, simply slowing down the transfer rate (or placing a cap on it) will only cause the forwarder in question to fall further and further behind realtime and will not ultimately change your total indexed volume once the indexing catches up. You'd have to actually push the indexing time across days for this to make a difference to your daily licensing level; which means that your events are going to be rather old and out of data by the time they get into splunk.
There may be other approaches like filtering out unwanted/unhelpful events. I've seen truncating stack traces to 2k make an enormous volume difference for some of our java-based application servers (the stack trace weren't very helpful to begin with.) In other cases we've had run-away (endless loop) produce massive logs; in which case I setup a splunk alert to email our support team and get the client app to shutdown. (Normally, splunk indexes so quickly that deleting the offending log file can't be done quick enough--that's one place were limiting the volume could help.) The bottom line is that we actually reduce the amount of data being indexed, rather than simply slowing it down.
BTW, are you sure your "grep" is getting you the right information? I don't see you search for a specific forwarder host. In case you didn't know, you can look for this same info inside splunk with a search like this:
index=_internal source=*metrics* "group=tcpin_connections" | rename _tcp_KBps as tcp_KBps | table sourceHost, kb, tcp_KBps
I have a workaround in place that will parse the logs every 5 minutes and shut the forwarder down and email us if it goes above 30 KBps. But Splunk shouldnt be letting itself go above 30 KBps in the first place. And we don't care if the end indexed data is the same, as long as it doesnt happen fast enough to violate our 20 GB/day license
We don't care about getting old data - if the team watching those logs finds they get old data, they will know its because their app has screwed up and they will fix it accordingly. We just dont want our license to be violated, thats our bottom line. And 30 KBps will accomplish it. To put it in perspective - we have had this server suddenly forward 30 GB in an hour. We just want to tell splunk it can't forward more than 30 KBps, we know the implications of this. The only problem is, splunk isnt doing what the documentation says this stanza will do.
{code}
08-19-2011 11:39:48.800 INFO Metrics - group=tcpout_connections, splunksit:10.226.18.60:7759:0, sourcePort=8089, destIp=10.226.18.60, destPort=7759, _tcp_Bps=115254.52, _tcp_KBps=112.55, _tcp_avg_thruput=112.55, _tcp_K
processed=3264, _tcp_eps=113.52
{/code}
This is an example of what the grep command returns - its much easier than running the search on the indexer every 5 minutes, and its quite accurate. From the metrics.log itself.