Solved: Multipule Large Data Sets

fk319 · ‎01-12-2011

I have several sources of data that run into my Splunk server, some of the data sets exceeded 1G per day.

What is the best way to keep the data seperated so that searches are quicker?

I have defined apps and sourcetypes, but not knowing the internals, not sure if this direction to go in.

Stephen_Sorkin · ‎01-12-2011

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

View solution in original post

Stephen_Sorkin · ‎01-12-2011

In general, for data volumes up to tens of GB per day there's no real advantage in separating the data to make search faster. There are some cases, however, where it makes sense to separate data into multiple indexes to gain "coherency" in the layout of data on disk to speed up raw data retrieval.

Specifically, if you have a low volume data set that's intermingled with a high volume data set and you commonly report on the entirety of the low volume data set, Splunk will have to decompress (and throw away) much of the high volume data set to get at the low volume one. In this case, segregating the lower volume data set into its own index can increase reporting performance from on the order of thousands of events per second to many tens of thousands of events per second.

If your searches are always over a small, scattered fraction of the data, and you can isolate that set, putting it in a separate index will help. If your reports are over many difference small, scattered data sets, without overlap, it's simplest and best to just keep the data in a single index.

Multipule Large Data Sets

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability

span_metrics: The OpenTelemetry-Idiomatic Way to See Inside Your Services

Join the Conversation