Getting Data In

Splunk "Gotchas": What are little mistakes that are easy to make and difficult to notice that cause significant problems? Share your war stories!

Esteemed Legend

How often do you see a question in Answers that is actually asking you for problems, not solutions? I have been asked to create a presentation on "Gotchas" in Splunk: little mistakes that are easy to make, difficult to notice/identify and that cause significant problems. I will give you a few examples based on my own experience. The best gotchas are problems that are so devious that they would be reported to Splunk by opening a case as a bug but which turned out not to be a bug at all. If you are with Splunk support and you see the same problem reported every week or month, that is exactly the kind of thing I am trying to unearth, understand and present.

Gotcha #1: "key=value" search fails to match events even though they exist - See this blogpost (exactly describes what happened to me):
http://blogs.splunk.com/2011/10/07/cannot-search-based-on-an-extracted-field/

Gotcha #2: File precedence problems (please share yours!) - I used a configuration to override a host by setting it from a string inside each event. Then I tried to set the TZ based on host. I was so sure that TZ setting would work (I had done it dozens of times) that I didn't even check. It turns out that (depending on how you do the host override) host override happens AFTER the TZ setting! This means that the TZ setting was skipped so I had to move the TZ setting in props.conf underneath a source-based stanza header, instead of a host-based stanza header.

Gotcha #3: Deployment Server swapout shenanigans - If you are a shell-guy like I am, you like to see exactly when stuff happens by watching the files and waiting for them to update. If you take this approach and go into an app directory on a DS-controlled Splunk node, you will be waiting forever because of how Splunk will swap out the app. Here is what happens. You are in the shell on a Splunk Forwarder in the $SPLUNK_HOME/etc/apps//local/ and you are watching props.conf. You make a change to this file on the DS and deploy the change. The Splunk Forwarder checks in with the DS and sees that the app is out of date and tries to remove the directory but it cannot because you are parked in it. So instead, the forwarder renames the directory you are in (and makes a note to delete it as soon as it can) and then creates a new app directory and deploys the change. So you think that you are in the directory that will change but you have been moved to the trash and you don't even know it!

Gotcha #4: Deployment Server over-aggressiveness - In a troubleshooting/outage situation, it is often necessary to modify one forwarder but leave all the others alone. If you are not aware that the forwarder is being controlled by a DS or if you do not understand how DS works, you may not know that your changes will be immediately overwritten. This is particularly confusing/disturbing/irritating/surprising if you combine this gotcha with #3! There are 2 ways to make a single-node local change and prevent it from being undone by the DS. First, you can block your node from contacting the DS by blocking 8089 (there are many methods to do this). You can also delete/modify deploymentclient.conf). If you take this entire-node approach, you create one of two very dangerous risks: if you do it wrong, you could easily cause all apps to be deleted or, just as serious, you could miss a critical update from the DS for some other app that you are not debugging. Worst of all, you might forget to re-enable the DS connectivity when you are done debugging. Because of these dangers, I prefer disconnecting just the single app from DS-control by adding this to "app.conf":


[install]
allows_disable = false
This causes the app to be unmodifiable by DS because the first thing the DS update does (must do) is disable the app so that nobody can use it; if it cannot disable the app, it cannot update the app!

OK, I gave you 4 of mine, now share your war stories and hard lessons learned: what Splunk Gotchas have gotten you?

Esteemed Legend

Just so y'all know, I have managed to put together a day-long training session on this topic that includes a whopping 36 different gotchas. I submitted this as a presentation for .conf 2015 (will have to be brutally cut back) but I am willing to present this moderate->advanced level training as a webinar if anybody would like. I have been telling Splunk to create this training for almost a decade and I just got sick of waiting. The interesting thing about these 36 things is that none of them will be interpreted by Splunk as an error so the problems are stealthy and usually overlooked (unless you know how to dig for them or happen to notice that your results are "off").

0 Karma

Communicator

One that I ran in to quite some times is the "Indexes searched by default" problem.
you know the data should be there but you don't see it, check config files, run btool everything is OK, than all of a sudden... I put the data in a new index and in my search I did not specify that index and the "Indexes searched by default" option is not set to "All non-internal indexes".

0 Karma

Influencer

Not quite what you're asking for, but you may also be interested in some of the "Things I wish I knew then": http://wiki.splunk.com/Things_I_wish_I_knew_then