Knowledge Management

Should summary indexing happen on the distributed searcher or on the indexer(s)?

hulahoop
Splunk Employee
Splunk Employee

Technically, summary indexing can be configured on either the search head or indexing server. Are there advantages/disadvantages on having it on one versus the other?

1 Solution

gkanapathy
Splunk Employee
Splunk Employee

If you have only one indexer and one search head, there might be some advantage to storing the summaries on the search head, because when they are subsequently retrieved, you are removing some of the search load from the the indexer. But if the indexer is substantially faster (or has much faster disk) and/or is very lightly loaded, then this won't help. In general, this is just simply a choice between whichever machine has more resources to spare at search time. (At summarization time, about the same work will be done on the indexer whichever way you do it.)

More interesting is the case where you have multiple indexers and a single search head (or many fewer search heads than indexers). It's really a judgement call, and it is much like the question of whether to run separate summarizations for daily data and for hourly data, or to report daily data by aggregating it up from 24 hourly summaries at search time.

Essentially, you're making a tradeoff between doing the additional aggregation summarization time vs doing at search time. The significance of this tradeoff depends on the data and the type of operations and stats you're summarizing to. But usually you want to do the work at summarization time (because after all, that's kind of the point of summary indexing), so in theory you should summarize on the search head. For light summarizations, this is probably the answer.

HOWEVER if you are doing a lot of summarization and reporting a lot from those summaries, you could easily overrun the search head with summarization jobs and queries on the local indexes, and leave it with little capacity to serve the route interactive user searches. (Search heads are not often specified for high indexing and disk loads.) It's also typically the case that the indexers will have more capacity for this kind of workload. In many cases, then the actual distribution of hardware resources and workload means it makes sense to not involve the search head in summarization, and to instead just run summaries on the indexers, and aggregate the results at search time.

My favorite (and most hardware-intensive) solution if you are indeed doing a lot of summarization is to use a dedicated summarizer. This is an machine that has reasonably fast disk to write and retrieve summarized data, and plenty of CPU to execute search jobs (sort of a hybrid search head and indexer). You schedule and run summarization jobs here, where they won't interfere with the interactive user searches, and you also aggregate and store the results so that you don't have bother the indexers or perform a final aggregation at search time. (It would be possible to go further and use multiple dedicated summarizers, but you need to have called Splunk PS in long before you get to that point.)

With respect to managing, it will probably be easier to schedule, run, check, backfill, and otherwise work with summarization on a single search head or dedicated summarizer (though tools and infrastructure like Deployment Server to manage an indexing server cluster will diminish this advantage).

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

If you have only one indexer and one search head, there might be some advantage to storing the summaries on the search head, because when they are subsequently retrieved, you are removing some of the search load from the the indexer. But if the indexer is substantially faster (or has much faster disk) and/or is very lightly loaded, then this won't help. In general, this is just simply a choice between whichever machine has more resources to spare at search time. (At summarization time, about the same work will be done on the indexer whichever way you do it.)

More interesting is the case where you have multiple indexers and a single search head (or many fewer search heads than indexers). It's really a judgement call, and it is much like the question of whether to run separate summarizations for daily data and for hourly data, or to report daily data by aggregating it up from 24 hourly summaries at search time.

Essentially, you're making a tradeoff between doing the additional aggregation summarization time vs doing at search time. The significance of this tradeoff depends on the data and the type of operations and stats you're summarizing to. But usually you want to do the work at summarization time (because after all, that's kind of the point of summary indexing), so in theory you should summarize on the search head. For light summarizations, this is probably the answer.

HOWEVER if you are doing a lot of summarization and reporting a lot from those summaries, you could easily overrun the search head with summarization jobs and queries on the local indexes, and leave it with little capacity to serve the route interactive user searches. (Search heads are not often specified for high indexing and disk loads.) It's also typically the case that the indexers will have more capacity for this kind of workload. In many cases, then the actual distribution of hardware resources and workload means it makes sense to not involve the search head in summarization, and to instead just run summaries on the indexers, and aggregate the results at search time.

My favorite (and most hardware-intensive) solution if you are indeed doing a lot of summarization is to use a dedicated summarizer. This is an machine that has reasonably fast disk to write and retrieve summarized data, and plenty of CPU to execute search jobs (sort of a hybrid search head and indexer). You schedule and run summarization jobs here, where they won't interfere with the interactive user searches, and you also aggregate and store the results so that you don't have bother the indexers or perform a final aggregation at search time. (It would be possible to go further and use multiple dedicated summarizers, but you need to have called Splunk PS in long before you get to that point.)

With respect to managing, it will probably be easier to schedule, run, check, backfill, and otherwise work with summarization on a single search head or dedicated summarizer (though tools and infrastructure like Deployment Server to manage an indexing server cluster will diminish this advantage).

jrodman
Splunk Employee
Splunk Employee

I often prefer when summarizing on the search head or when using a dedicated summarization host -- having the summary generating nodes forward their data to the indexers to store and search.

0 Karma
Get Updates on the Splunk Community!

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...