Will splitting our data into separate indexes prov...

campbellj1977 · ‎05-20-2015

We are currently running into issues where our indexers become overloaded and cannot process all of the search and indexing functions when real time searches are abundant. We have identified this partially to be caused by the searching of raw data before indexing simultaneously with all of the other data combined. Our single index is currently housing about 70% of all our incoming data. Logic would tell me that smaller indexes amount in quicker searches and less incoming raw data to be searched. I tested and it seems to be true, although I don't really trust the results from my lab as it shares storage and process utilization with other servers in VM's and storage.

I was hoping that someone from Splunk or the community could confirm my findings. My test was listed below.
Eventgen to create sample logs; 10Mb a 5 minute span
postfix tcpdump to file then to splunk; 5mb in about 5 minutes

With all the data going into the same index, searching for a key values from postfix tcpdump and/or eventgen, it took about 30% less time to complete the search then when splitting the data into 2 separate indices.

So to sum up, will splitting our indexes provide better performance of real-time searches and less processing time?

MuS · ‎05-20-2015

Hi campbellj1977,

well, this is tricky to answer for sure; because your finding are most likely to be true for any searches, reports, alerts and dashboards. But not for the real-time searches. This is because real-time searches search through events as they stream into Splunk Enterprise for indexing. When you kick off a real-time search, Splunk Enterprise scans incoming events that contain index-time fields that indicate they could be a match for your search.

So, it is likely that you will not get a performance benefit out of splitting...But, how about adding a second indexer to add overall performance or troubleshoot what exactly gets your indexer overloaded or blocked. If your on Splunk 6.2 you can use the internal Distributed Management Console for this http://docs.splunk.com/Documentation/Splunk/6.2.3/Admin/ConfiguretheMonitoringConsole#What_is_the_di...

Take a look at the pipeline image below to get an overview of the Splunk input pipeline:

Regarding your comment about the forwarder; this will only skip a part of the parsing queue but still uses all other queues (the universal forwarder that is). If you're using a heavy forwarder in front, it will skip parsing, merging and typing and in the indexerPipe until tcpoutput . This would only help if your indexer has blocked parsing queues, because the heavy forwarder would take the parsing, merging and typing load.

Your real-time search will still pick up the events right at the stage they stream in into the indexer. But they maybe benefit from the fact that the indexer has less load when you're using the heavy forwarder in front....

Hope this helps ...

cheers, MuS

campbellj1977 · ‎05-21-2015

What if data is already partially cooked from an intermediate forwarder?

MuS · ‎05-21-2015

see the updated answer....

Will splitting our data into separate indexes provide better performance of real-time searches?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

Will splitting our data into separate indexes provide better performance of real-time searches?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?