Solved: Multipule Large Data Sets

fk319 · ‎01-12-2011

I have several sources of data that run into my Splunk server, some of the data sets exceeded 1G per day.

What is the best way to keep the data seperated so that searches are quicker?

I have defined apps and sourcetypes, but not knowing the internals, not sure if this direction to go in.

Stephen_Sorkin · ‎01-12-2011

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

View solution in original post

Stephen_Sorkin · ‎01-12-2011

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

Multipule Large Data Sets

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases