Our enterprise has two data centers, and each data center has a Splunk indexing cluster. Data is replicated between the clusters. There are search heads at each data center, each with a search affinity for their local indexers. I am an admin only of two search heads, supporting a particular division within the enterprise. We have had a lot of personnel turnover recently, so I am trying to help support the operations team.
Several months ago, we had only one indexing cluster at one site, and I was managing just a single search head. On this search head, we have a number of saved searches that run every five minutes to look for particular events. These searches were crafted to look at events indexed within the last five minutes (so something like
earliest=-3mon latest=now _index_earliest=-5m _index_latest=now ...). The matching events from these searches are fed into a summary index that is monitored by a bot that pulls new events and feeds them into a ticketing system. I periodically checked in on the searches and double-checked that they were catching all events, so I can say with confidence that things were working as desired.
Now that we have the two indexing clusters, I have all of the saved searches running on a search head at data_center2, and I discovered that the searches were missing a very high percentage of events. I am trying to figure out why this is happening and how to best reduce the time between Primary Event -> Splunk Indexing -> Splunk Scheduled Search -> Summary Indexing -> Bot Search.
Here's the kicker on all of this. If I generate a test event and
collect it directly into the summary index with
| makeresults | eval test_field="findme" | collect index=my_summary_index, I see a very reasonable indexing latency measurement of ~2 sec by comparing the
_indextime fields in the event. But when I generate the test event and immediately begin searching for it, the event will not show up in searches for approximately 5 minutes. So, for example, if I generate the test event at 13:05:04, the
_time field will be 13:05:04, and the
_indextime field will be 13:05:06, but if I am (over and over) running a search known to match the event, the search will show no results until I run it at approximately 13:10.
The event is being generated on a search head at datacenter2, being sent to the indexing cluster at datacenter2, and being searched for on the same search head at datacenter2 - though if it matters, the results at the same if I generate the event on the search head at datacenter1 and/or search for the event on the search head at data_center1.
Thus far, we've mitigated this impact by altering our saved searches to run like this:
earliest=-3mon latest=now _index_earliest=-10m _index_latest=-5m ..., but then we also have to force our bot to run with a similar delay, which means events that used to reach the analyst ticketing system within 10 minutes from the time of the Primary Event are now arriving more like 20+ minutes later. It's a lot harder for the customer to stomach this delay, given that they used to get their alerts so much more quickly. I welcome any advice, but I'll apologize that I don't have access to any of the indexers, so I'll be really slow to provide any conf files and such if requested. I can ask the ops team to get them for me, but their turnaround time is not very quick. Thanks in advance!
We went through a similar realization in How can we avoid data loss in the summary indexes when there is an indexing latency in the cluster?
Customers are really upset to hear that their summary indexes are almost two hours old ; - ) real time searches maybe?
We don't have enough resources to run real-time searches for these events, unfortunately.
I read through your post before I created this one, but I wondered if the results/ideas might differ based on:
It would make more sense to me if I was creating the event on one indexing cluster and then having to wait for it to arrive on the other cluster - data replication takes time; I get that. But I'm directing it to the same cluster as it always used, and I'm searching with a search head that has a search affinity for that same cluster.
You are experiencing latency due to search affinity. One of your search heads will be able to see the data immediately, and the other will see the data after a delay.
Note: Hot bucket data is replicated in blocks, as described in "How clustered indexing works". If a local search involves a replicated hot bucket copy, where the origin copy is on a different site, there might be a time lag while the local peer waits to get the latest block of hot data from the originating peer. During this time, the search does not return the latest data.