Solved: Re: Distributed Splunk workflow understanding

guahos · ‎07-31-2016

Hello Splunkers!
I am currently setting up a distributed Splunk system in our company.
It consists of: 2 Indexers and a Cluster Master Node, a standalone Search Head and a standalone Deployer/License Master.
Please help me to clarify the logic behind such system.
As far as I understand it currently, the complete workflow looks like the following:
1. Forwarders send data directly to the Indexers (to both of them by turn, as it's configured with inputs.conf), they use TCP:9997 for that type of communication.
2. After the data reaches one of the Indexers, it got indexed first, than Indexers replicate received data to each other, using TCP:8080 for that.
That's it with data getting into Splunk.
After it's indexed, we could start searching, and that's how it works, as I see:
3. We get into the Search Head via web, using TCP:8000, then we type a query and the search itself begins.
4. Search Head tells the Master Node what exact kind of data needs to be found, using TCP:8089.
5. Then, the Master Node tells the Indexers what kind of data they need to give back, using again TCP:8089.
6. Then, both Indexers (Search Factor=Replication Factor=2) send the search result to the Master Node simultaneously (Improving the search speed) via TCP:8089.
7. And after that, the Master Node finally sends the search result to the Search Head via again TCP:8089, where it's available for a user.

If that is all described correctly, then I have one more question on License counting: Does each separate indexer tell the License server about how much data it has collected, or the Indexing Cluster Master Node tells the License Server what amount of data has been indexed by the Indexers?

martin_mueller · ‎07-31-2016

Steps 1 through 3 are pretty much correct, except that forwarders of course configure their output in outputs.conf.
The rest is a bit of a mess, your understanding of the master node is off. The master node does not participate in the actual searching of data, all it does is tell the search head(s) about the indexers and make sure the replication factors are met between the indexers as well as that indexers have correct and identical configurations.
For searches, the search head tells the indexers what data it needs. The indexers then fetch the data, perform preprocessing and aggregation where possible (map step), and return the results to the search head for final processing and aggregation (reduce step).
Having two indexers speeds up the map step through horizontal scaling, each indexer receives and searches only half the data.
Having a search factor of 2 does not speed things up, both indexers are busy serving "their own" data (in Splunk terms: primary copies of buckets) while replicated data is not searched.

View solution in original post

guahos · ‎10-07-2016

Thank you, Martin and dwaddle, you have really opened my eyes on how the map reduce works! 🙂
However, I am confused about why a search factor of 2 does not speeds things up..
As far as I see it, if we have 4 data blocks indexed (A,B,C and D) on both of 2 indexers with replication factor=search factor=2, both indexers will contain the same 4 data blocks because of the replication, but the first indexer will have data blocks A and B as primary and C and D as non-primary, and the second indexer will have data blocks C and D as primary and A and B as non-primary. So, during the map phase Search head will send a request to both indexers for looking for some data across all 4 data blocks and the first indexer will have to search only across data blocks A and B and the second indexer will only have to search across data blocks C and D simultaneously.
So overall searching the same search request at a single indexer would take twice longer, than the map phase would take in our case with 2 indexers clustered and with replication factor=search factor=2.. Where am I wrong?

martin_mueller · ‎10-08-2016

To be as concise as possible, the speed gain in your example stems from distributed search, not from clustering / replication. If your RF/SF both were 1, the speed would be the same.

dwaddle · ‎10-07-2016

Let's speak a bit more precisely to help explain. Say we have buckets A, B, C, and D and an SF=2 - so we could call the 4 buckets and their redundant copies A1, A2, B1, B2, C1, C2, D1, D2. Within the clustering system, one copy (and only one copy) of each bucket will be flagged as primary. We'll mark that as (P). So one valid configuration is A1, A2(P), B1, B2(P), C1(P), C2, D1, D2(P). Splunk only ever searches primary buckets. Non-primary copies of buckets are ignored from a search point of view.

The reason clustering has the concept of "primary buckets" is to avoid duplicate events during a search. If we use our "average session count" example above - by the time data gets back to the search head, the 'identity' of which bucket(s) on the indexer it came from is lost. And indexers do not have a way to cross-communicate to make sure they don't return duplicate events from different copies of the same bucket. So Splunk avoids the issue by only having one primary copy of each bucket, and only the primary is searched.

Historically, Splunk had distributed search before it had clustering. Clustering provides redundancy, distributed search provides map-reduce. There is definitely an effect of "more indexers leads to increased search performance" but it has nothing at all to do with clustering - it is part of the nature of distributed search.

The "performance effect" of having more indexers is not a function of the search factor at all - it is a function of the number of indexers and the number of buckets overall. By adding indexers, we cause forwarders to "spray wider" at data ingest time - this causes more buckets overall in the system. More buckets overall in the system still means that each bucket only has a single primary, searchable copy. But, with more buckets in the system overall, we have a situation where the primary copies of the larger number of buckets are spread out over more nodes. This gives each node a smaller amount of data to have to work through from a search perspective and therefore better performance.

(The first time I wrote this was better, sorry)

dwaddle · ‎10-07-2016

test, I wonder if my other comment got lost. Clearly it did I'll rewrite it.

martin_mueller · ‎07-31-2016

Steps 1 through 3 are pretty much correct, except that forwarders of course configure their output in outputs.conf.
The rest is a bit of a mess, your understanding of the master node is off. The master node does not participate in the actual searching of data, all it does is tell the search head(s) about the indexers and make sure the replication factors are met between the indexers as well as that indexers have correct and identical configurations.
For searches, the search head tells the indexers what data it needs. The indexers then fetch the data, perform preprocessing and aggregation where possible (map step), and return the results to the search head for final processing and aggregation (reduce step).
Having two indexers speeds up the map step through horizontal scaling, each indexer receives and searches only half the data.
Having a search factor of 2 does not speed things up, both indexers are busy serving "their own" data (in Splunk terms: primary copies of buckets) while replicated data is not searched.

dwaddle · ‎07-31-2016

Martin is (as always) entirely correct. 🙂

As far as the map-phase and reduce phase, sometimes it's useful to elaborate on an example. Suppose you have data in Splunk that an event that stores a simple integer value like session_count. What you need to compute is the average (arithmetic mean) of session_count. We know we can algebraically define avg(session_count) as sum(session_count) / count(session_count).

In a "traditional" (non map-reduce) world you might compute this by directly computing the sum(session_count) and count(session_count) in a straightforward sequential manner.

But, given your two indexers you can have each of them compute the correct subtotals in a parallel, distributed manner. You request each indexer calculate sum(session_count) as sub1 and count(session_count) as sub2 independently (map phase) and then the search head performs a sum(sub1) / sum(sub2) (reduce phase). This is algebraically the same, but from a computation point of view is faster because of the parallelism provided by the multiple indexers working on the map-phase in parallel.

Masa · ‎08-01-2016

Not much to say here;

Just adding a doc link related to the part of How Search Head in Indexer Clustering works;
http://docs.splunk.com/Documentation/Splunk/6.4.2/Indexer/Howclusteredsearchworks

Distributed Splunk workflow understanding

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Are you a member of the Splunk Community?

Distributed Splunk workflow understanding

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...