Our organization is evaluating Splunk. When getting to the root cause, we'd like to understand examples of where your organization is forced to 1) do manual collection of diagnostic data (log, traces) outside of splunk and/or 2) export the data collected by splunk and analyze it in another tool. Please provide 1) specifics on why you had to do this and 2) how often. Thanks. Steve.
It depends on how good your Splunk guy is. I started out with about a 5x increase it TTR right off the bat with the savings just in gathering, combining and time-sequencing logs and it moved to about 10x shortly thereafter once I learned RegEx
(for field extractions) and the stats
command. It is much faster now that I am better at Splunk and use it for more than the Data Butler
use case. I would guess that 10X is about average but it can be much, MUCH higher. For example, we had a cluster of authentication servers which were not stateful (any portion of the authentication could occur on any server) and we had dozens of these servers. Pre-Splunk, we had to call in the vendor to gather and aggregate the logs for us (they did not allow us onto their servers) and this took a whole day if it was not an emergency and at least an hour even if it was. Now we do it ourselves in near realtime and even the vendor uses our Splunk cluster for their own testing/debugging rather than do it the old/normal way. What you get for "free" right out of the box is 90% of what you need in most firedrill situations so you don't really have to be that good with Splunk; it is enough to be able to quickly answer these questions (which, as I said, you pretty much get with just inputs.conf
😞
I realize splunk can really save on TTR, but still wondering what % of time you still have to go outside splunk to solve the problem in your environment and why? Trying to understand cases where it doesn't solve your problem and why?
Thanks a bunch!
I would say about 1/3 of the time and it falls into the following categories:
o Rarely, we need to prove that Splunk itself is wrong so we go into another system that has a different view on the same data and if it is different (one system is wrong) we need to dig deep in both systems to figure out why.
o We have not (yet) put the other data that we need to see inside Splunk.
o We need to enable debugging to enhance logging in some system (need more detail for stuff already in Splunk).
o We have pointed the finger at a system and we need to login to that system to check the configurations that control it (confirmation/resolution stage).
o We think we understand the problem and the resolution but we need to reproduce it in a test system and then test the fix (policy requirements in confirmation/resolution stage).