Deployment Architecture

In a mulit-site cluster, does Splunk replicate just the data to the remote and then the remote site indexer indexes the data?

dokaas_2
Communicator

In an multi-site cluster Splunk replicates the data to the remote site, but doe Splunk also replicate the index information or is indexing left up to the remote site indexer? Also, does Splunk replicate raw data or compressed data?

0 Karma

hrawat
Splunk Employee
Splunk Employee

Excellent question.
> or is indexing left up to the remote site indexer?
Left up to remote site indexer. With RF 3, remote is indexing twice compared to source. Overall each node is indexing additional 2 replicated slices.

>Also, does Splunk replicate raw data or compressed data?
If ssl is enabled and on indexers, following config in server.conf are set to true.

Under stanza

[replication_port-ssl://<port>]
useSSLCompression = <boolean>
* If true, enables SSL compression.
* Default: false

compressed = <boolean>
* DEPRECATED; use 'useSSLCompression' instead.
* Used only if 'useSSLCompression' is not set.

 Under stanza

[sslConfig]
allowSslCompression = <boolean>
* If set to "true", the server allows clients to negotiate
  SSL-layer data compression.
* KV Store also observes this setting.
* If set to "false", KV Store disables TLS compression.
* Default: true

 

0 Karma

dokaas_2
Communicator

Thanks, but after a little further research, that seems inconsistent with https://docs.splunk.com/Documentation/Splunk/;atest/Indexer/Howclusteredindexingworks. Of course this does not specifically address multisite clusters, but I'd assume this would apply. What do you think?

++++++++++++++++++++++++++++++++++++++
These events occur when a peer node starts up:
1. The peer node registers with the master and receives the latest configuration bundle from the master.

  1. The master rebalances the primary bucket copies across the cluster and starts a new generation.
  2. The peer starts ingesting external data, in the same way as any indexer. It processes the data into events and then appends the data to a rawdata file. It also creates associated index files. It stores these files (both the rawdata and the index files) locally in a hot bucket. This is the primary copy of the bucket.
  3. The master gives the peer a list of target peers for its replicated data. For example, if the replication factor is 3, the master gives the peer a list of two target peers.
  4. If the search factor is greater than 1, the master also tells the peer which of its target peers should make its copy of the data searchable. For example, if the search factor is 2, the master picks one specific target peer that should make its copy searchable and communicates that information to the source peer.
  5. The peer begins streaming the processed rawdata to the target peers specified by the master. It does not wait until its rawdata file is complete to start streaming its contents; rather, it streams the rawdata in blocks, as it processes the incoming data. It also tells any target peer(s) if they need to make their copies searchable, as communicated to it by the master in step 5.
  6. The target peers receive the rawdata from the source peer and store it in local copies of the bucket.
  7. Any targets with designated searchable copies start creating the necessary index files.
  8. The peer continues to stream data to the targets until it rolls its hot bucket.
0 Karma

nickhills
Ultra Champion

Starting with Splunk 6, there were replication optimisations added which would copy tsidx files (if available) to save the effort of regenerating them. (Can't find link, will update).

The above is totally correct for new data, let me see if I can find some detail on the tsidx copy process. (AFK)

If my comment helps, please give it a thumbs up!
0 Karma

nickhills
Ultra Champion

Found it,
https://www.splunk.com/en_us/blog/tips-and-tricks/clustering-optimizations-in-splunk-6.html

I'll update my answer to clarify. The behaviour is different depending if its normal replication, or replication following a recovery, I answered focussed on the latter. Hopefully my edit will make sense.

If my comment helps, please give it a thumbs up!
0 Karma

dokaas_2
Communicator

Thanks. I can see the benefits and pitfalls of both ways.

So if the data is replicated to the remote site (target) to be indexed, assuming two sites, does one need to size the number of indexers based on the combined ingestion rate at both sites?

0 Karma

nickhills
Ultra Champion

Technically, the answer is "yes". Its a good idea to make sure that all your indexers are comparable specs across sites, however the number of indexers at each site can vary depending on that given sites ingest rate.

Depending on how many search users site2 has and/or if you are using search affinity, you can set your SF for remote sites accordingly. There is no point keeping high SF values at remote sites if you never search Site 1 data from site 2 which helps reduces the burden.

If my comment helps, please give it a thumbs up!
0 Karma

dokaas_2
Communicator

Two sites, RF=2 SF=2 + Splunk SmartStore. Design for 3TB/day at one site and 2TB/day at the second site (5TB/day total). Both sites will search across all data.

0 Karma

nickhills
Ultra Champion

There are two scenarios in which replication occurs.

Normal Replication
In normal circumstances Splunk will replicate data to cluster peers as it is indexed. It does this by streaming the rawdata to the peers according to the replication factor.
If a given peer is tasked with making a bucket searchable, it will begin the process of generating the tsidx file for that bucket.

Failure/Recovery Replication

In the event that a peer is replicating data to peers in order to meet SF/RF following a member failure, Splunk will replicate the rawdata for replicated buckets, and both the rawdata & tsidx file for searchable buckets.

This means that if your Site 2 cluster of 3 peers has rf=2 sf=1, then one indexer will receive both the rawdata and the index (per bucket) and one peer will just receive the rawdata.

In all cases, the raw data is replicated from the journal file, which is compressed.

If my comment helps, please give it a thumbs up!
Get Updates on the Splunk Community!

Splunk Enterprise Security 8.0.2 Availability: On cloud and On-premise!

A few months ago, we released Splunk Enterprise Security 8.0 for our cloud customers. Today, we are excited to ...

Logs to Metrics

Logs and Metrics Logs are generally unstructured text or structured events emitted by applications and written ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...