In an multi-site cluster Splunk replicates the data to the remote site, but doe Splunk also replicate the index information or is indexing left up to the remote site indexer? Also, does Splunk replicate raw data or compressed data?
Excellent question.
> or is indexing left up to the remote site indexer?
Left up to remote site indexer. With RF 3, remote is indexing twice compared to source. Overall each node is indexing additional 2 replicated slices.
>Also, does Splunk replicate raw data or compressed data?
If ssl is enabled and on indexers, following config in server.conf are set to true.
Under stanza
[replication_port-ssl://<port>]
useSSLCompression = <boolean> * If true, enables SSL compression. * Default: false compressed = <boolean> * DEPRECATED; use 'useSSLCompression' instead. * Used only if 'useSSLCompression' is not set.
Under stanza
[sslConfig]
allowSslCompression = <boolean> * If set to "true", the server allows clients to negotiate SSL-layer data compression. * KV Store also observes this setting. * If set to "false", KV Store disables TLS compression. * Default: true
Thanks, but after a little further research, that seems inconsistent with https://docs.splunk.com/Documentation/Splunk/;atest/Indexer/Howclusteredindexingworks. Of course this does not specifically address multisite clusters, but I'd assume this would apply. What do you think?
++++++++++++++++++++++++++++++++++++++
These events occur when a peer node starts up:
1. The peer node registers with the master and receives the latest configuration bundle from the master.
Starting with Splunk 6, there were replication optimisations added which would copy tsidx files (if available) to save the effort of regenerating them. (Can't find link, will update).
The above is totally correct for new data, let me see if I can find some detail on the tsidx copy process. (AFK)
Found it,
https://www.splunk.com/en_us/blog/tips-and-tricks/clustering-optimizations-in-splunk-6.html
I'll update my answer to clarify. The behaviour is different depending if its normal replication, or replication following a recovery, I answered focussed on the latter. Hopefully my edit will make sense.
Thanks. I can see the benefits and pitfalls of both ways.
So if the data is replicated to the remote site (target) to be indexed, assuming two sites, does one need to size the number of indexers based on the combined ingestion rate at both sites?
Technically, the answer is "yes". Its a good idea to make sure that all your indexers are comparable specs across sites, however the number of indexers at each site can vary depending on that given sites ingest rate.
Depending on how many search users site2 has and/or if you are using search affinity, you can set your SF for remote sites accordingly. There is no point keeping high SF values at remote sites if you never search Site 1 data from site 2 which helps reduces the burden.
Two sites, RF=2 SF=2 + Splunk SmartStore. Design for 3TB/day at one site and 2TB/day at the second site (5TB/day total). Both sites will search across all data.
There are two scenarios in which replication occurs.
Normal Replication
In normal circumstances Splunk will replicate data to cluster peers as it is indexed. It does this by streaming the rawdata to the peers according to the replication factor.
If a given peer is tasked with making a bucket searchable, it will begin the process of generating the tsidx file for that bucket.
Failure/Recovery Replication
In the event that a peer is replicating data to peers in order to meet SF/RF following a member failure, Splunk will replicate the rawdata for replicated buckets, and both the rawdata & tsidx file for searchable buckets.
This means that if your Site 2 cluster of 3 peers has rf=2 sf=1, then one indexer will receive both the rawdata and the index (per bucket) and one peer will just receive the rawdata.
In all cases, the raw data is replicated from the journal file, which is compressed.