Deployment Architecture

Multipule Large Data Sets

fk319
Builder

I have several sources of data that run into my Splunk server, some of the data sets exceeded 1G per day.

What is the best way to keep the data seperated so that searches are quicker?

I have defined apps and sourcetypes, but not knowing the internals, not sure if this direction to go in.

Tags (1)
1 Solution

Stephen_Sorkin
Splunk Employee
Splunk Employee

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

View solution in original post

Stephen_Sorkin
Splunk Employee
Splunk Employee

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Tech Talk Recap | Mastering Threat Hunting

Mastering Threat HuntingDive into the world of threat hunting, exploring the key differences between ...

Observability for AI Applications: Troubleshooting Latency

If you’re working with proprietary company data, you’re probably going to have a locally hosted LLM or many ...

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

In the age of AI, every tool promises to make our lives easier. From summarizing content to writing code, ...