Solved: How does a Splunk cluster replicate data across da...

xchang1226 · ‎09-17-2014

Does anyone know how splunk cluster does replication? Specifically, synchronously or asynchronously?

The reason I am asking is that we are trying to do multi-site clustering across 2 data centers, so that it not only provides HA but also DR solution for us. The network connection btw our 2 data centers are pretty fast, but obviously there is still some latency. So if it is synchronous replication, then the writes will not commit until both sides get the same data, that will affect the performance on the indexers.

Anyone has implemented clustering across data centers that can share your experience? Thanks.

jrodman · ‎09-17-2014

Splunk does not neatly fit into these two words.

When data is arriving in a streaming fashion to an indexer, it processes that data and then writes the processed data to disk in a streaming fashion, but also streams the processed form of that data to N other indexers. All of these indexers commit the data to disk continuously. The streaming target indexers respond with information about what items they've been sent have been fully committed, and when the initial indexer has fully committed and gotten notification of full commits from its peers, it can tell the data provider that the data has been committed.

This works because we don't have a single integrated structure (conceptually) like an RDBMS. There aren't updates to our database. We just build small (10gb) quanta of duplicated containers in parallel. If there is error recovery, the total IOs may be higher, but they can still proceed continuously without slowing any part of indexing past I/O limitations.

This design, not being based around a transaction->commit model means that you can end up with data duplication in some error conditions, (data provider is talking to indexer that crashes, so sends data elsewhere). However that is a fundamental problem, because the data provider is not a database application making use of a transaction api, but rather an application that wrote to some log files it will rotate out of existence without asking us, or an application which sent some bytes over a socket and is done with them. There are partial solutions to that but they are performance-prohibitive at the data volumes involved, and nothing is going to make the data over a syslog-tcp socket or in the tcp buffers safe on crash.

So it's a different sort of solution for a different kind of data flow. But the short version is that replication does not require any sort of stalling for synchronizing, but oversaturated disk or network links are concerns.

View solution in original post

jrodman · ‎09-17-2014

Splunk does not neatly fit into these two words.

When data is arriving in a streaming fashion to an indexer, it processes that data and then writes the processed data to disk in a streaming fashion, but also streams the processed form of that data to N other indexers. All of these indexers commit the data to disk continuously. The streaming target indexers respond with information about what items they've been sent have been fully committed, and when the initial indexer has fully committed and gotten notification of full commits from its peers, it can tell the data provider that the data has been committed.

This works because we don't have a single integrated structure (conceptually) like an RDBMS. There aren't updates to our database. We just build small (10gb) quanta of duplicated containers in parallel. If there is error recovery, the total IOs may be higher, but they can still proceed continuously without slowing any part of indexing past I/O limitations.

This design, not being based around a transaction->commit model means that you can end up with data duplication in some error conditions, (data provider is talking to indexer that crashes, so sends data elsewhere). However that is a fundamental problem, because the data provider is not a database application making use of a transaction api, but rather an application that wrote to some log files it will rotate out of existence without asking us, or an application which sent some bytes over a socket and is done with them. There are partial solutions to that but they are performance-prohibitive at the data volumes involved, and nothing is going to make the data over a syslog-tcp socket or in the tcp buffers safe on crash.

So it's a different sort of solution for a different kind of data flow. But the short version is that replication does not require any sort of stalling for synchronizing, but oversaturated disk or network links are concerns.

jcspigler2010 · ‎05-12-2017

Jrodman

Sorry to dig up a 3 year old question but wanted to expand the question a little bit. I understand that the streaming of data from FWDr to IDXr, stream of data from IDXr of origin to N IDXrs and the commit is reflected back to the forwarder. I'm assuming this is the scenario if you have useACK = true. But from a searching perspective, if I am running a real time search on the IDXr of origin, will I only see the data after the commits have happened at all participating N IDXrs or after the streamed data has made its way through the data pipeline? I would assume its the latter of the two, but wanted to make sure. This question came up at one of my customers where were setting up a multi site cluster.

Thanks!

How does a Splunk cluster replicate data across data centers?

Unlock Database Monitoring with Splunk Observability Cloud

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk

Join the Conversation

How does a Splunk cluster replicate data across data centers?

Unlock Database Monitoring with Splunk Observability Cloud

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk