Every so often, our search head cluster (6.5.2) switches captain. Whenever this happens, or possibly for other reasons, our realtime scheduled searches get replicated. We have three search heads, so at the moment we have three copies of each of our realtime searches. Because of this, we get flooded with extra alerts in the form of emails, or other actions that these realtime searches were configured to initiate.
Is there a way to ensure that only one copy of each search gets scheduled at a time (even for realtime searches)? Ideally, we also want the searches to fail over to an alternate search head of that one goes down. Maybe this is just a bug.
Also, what is best way (balancing for speed and safety) to clean up the dozens of duplicate realtime jobs when this does happen? If the only way is to restart one or more search heads, then so be it.
For what it's worth, you could use the curl command in TA-webtools to "kill" one of the two realtime searches when this happens.
It would look something like this:
... search that identifies the SID of the realtime search you want to kill and the host it is running on (maybe search to identify all searches with same name that are currently running + a dedup command, etc) ... | map [ |curl method=post splunkauth=t uri=https://$host$:8089/services/search/jobs/$SID$/control?action=cancel ]
maybe like this:
| rest /services/search/jobs | table sid label splunk_server | search sid=rt* | dedup label splunk_server | map [|curl method=post splunkauth=t uri=https://$splunk_server$:8089/services/search/jobs/$sid$/control?action=cancel ]
Only you probably want to add some logic to get the oldest SID instead of dedup
Something like this:
| rest /services/search/jobs | fields sid splunk_server label published | eval published=strptime(published,"%FT%T") | streamstats max(published) as latest | where published!=latest AND label="SAVEDSEARCH NAME TO PRUNE" | map [|curl method=post splunkauth=t uri=https://$splunk_server$:8089/services/search/jobs/$sid$/control?action=cancel ]
For what it's worth, you could use the curl command in TA-webtools to "kill" one of the two realtime searches when this happens.
It would look something like this:
... search that identifies the SID of the realtime search you want to kill and the host it is running on (maybe search to identify all searches with same name that are currently running + a dedup command, etc) ... | map [ |curl method=post splunkauth=t uri=https://$host$:8089/services/search/jobs/$SID$/control?action=cancel ]
maybe like this:
| rest /services/search/jobs | table sid label splunk_server | search sid=rt* | dedup label splunk_server | map [|curl method=post splunkauth=t uri=https://$splunk_server$:8089/services/search/jobs/$sid$/control?action=cancel ]
Only you probably want to add some logic to get the oldest SID instead of dedup
Something like this:
| rest /services/search/jobs | fields sid splunk_server label published | eval published=strptime(published,"%FT%T") | streamstats max(published) as latest | where published!=latest AND label="SAVEDSEARCH NAME TO PRUNE" | map [|curl method=post splunkauth=t uri=https://$splunk_server$:8089/services/search/jobs/$sid$/control?action=cancel ]
Do you truly need them to be realtime? ; -)
Hahaha! I guess it's unanimous. Don't use realtime. Maybe "rt" should be a deprecated feature?
To be honest, I almost never give out schedule_rtsearch. When people ask for rt jobs, I generally tell them to write a 2 page business justification for why rt is needed instead of running every few minutes. Then I tell them that their justification will be used as part of a proposal to purchase more hardware.
Splunk will never be a sub-M trading system. I get it. But we do have cases of automation where one minute is just too long. As I said, I would be happy with 10 seconds (think automated response systems). In some cases one minute feels like batch on a mainframe.
Even with the limitations of realtime searches, they fit a niche. We have enough hardware for our current load. In fact, we were running fine even with all our realtime jobs running in triplicate.
I tried quickly disabling and re-enabling one of the rt jobs running in triplicate and it became one job running on the Captain. It seemed like this cleared up the other jobs as well. Now all the rt jobs are running on the Captain. I suppose I should just open a case on that. Maybe if I force a Captain switch, I can duplicate the issue.
I the meantime I'll create a job that runs every 5 minutes that looks for multiples of realtime jobs and then runs a script to quickly disable and re-enable an rt search. Maybe I just have to modify an rt search for the Captain to do it's job.
Read what I said again. You are believing a lie. The ONLY time that standard realtime really does what people think that it can/does is if you are using DATETIME_CONFIG=CURRENT
. Are you (I already know: you aren't). If you are setting _time
to when the event really happened and you are doing a realtime search, you almost certainly are missing events that you assume you are seeing. The shorter the window, the more events that you are missing. You are using a window<<1 minute so your search is surely intensely blind, but you don't know it. It is BAD MOJO all the way around.
I concur; why realtime? It is nothing but trouble and it is a false paradigm, prone to dropouts, misunderstanding and misuses of all kinds.
I wish there was a way to run a scheduled search every 10 seconds, but we have a number of use cases where a minute is too long. Maybe we have to build something or use another product.
Agreed! We allow one or two real-time searches in our environment - total. I even discourage that. I haven't seen a single RT search yet that couldn't actually be handled with a 1 minute scheduled search, except for when you are doing an adhoc, debugging, only for a few minutes, reasonable search. Unless you are monitoring something like Space Shuttle Re-entry data, RT is just hard on your servers.
I agree with dddillic. You almost NEVER need real time searches. Run them over short time ranges on a very short schedule. like over last 15 minutes every 5 minutes.