DISTRIBUTED SEARCH: performance -vs- highavailabil...

wmosher · ‎05-11-2011

We'd like to do a distributed search setup but it doesn't look like we'll be able to afford a second cluster of search peers for redundancy. If I understand things right (which very well may not be the case) this leaves me with two potential options (for simplicity sake we'll assume two search peers):

OPTION 1: Send only half the data (by forwarder config or through load-balancing) to each search peer (indexer).

ASSUMED CON: If a search peer goes down half the data is left out of any search.

OPTION 2: Send all data to both search peers (indexer).

ASSUMED CON: Search performance
decreased because each has to index
double the data.
ASSUMED CON: Double the disk space
needed for desired retention.
ASSUMED CON: Must load-balance
searches or add dedup to every
search.

Knowing full well that only I can answer this - which option is worse? Hopefully someone can tell me I just don't understand the intensions of distributed search or that there is some other solution.

Thanks.

gkanapathy · ‎05-11-2011

Option 2 is not really something I'd recommend with only two servers unless our requirements are very particular to what it does, you should not use distributed search. You can load balance the UI between the two indexers, since they will have the same data though. Other than that, you should note that the second option also requires double the license volume, since you're indexing once, forwarding, and indexing again.

Using dedup is not the right answer, since, first, it will massively slow down every search, and second, you won't be able to tell if there legitimately are two identical entries.

Option two is expensive and also ineffective when you only have two servers anyway. You must choose between one of two sub-options in this case:

When the secondary goes down, the primary will stop (after filling its queue) and block until it can continue sending data. The problem here is that you will have worse availability for indexing, though you will always have two copies of any indexed data, and once indexing resumes, your copies will be in sync.
When the secondary goes down, the primary keeps on indexing. The problem here is that two indexers will now be out of sync, and so search results will be different depending which machine you go to.

Not knowing anything at all about what your business requirements are, I would generally first suggest option 1 in combination with RAID disks and regular backups of the data. Indexing onto highly-available networked storage is also an option, and would allow you to remount a volume in case of server failure in lieu of restoring from backup, though this doesn't help you if the controller corrupts or someone accidentally "rm"s the index.

wmosher · ‎05-12-2011

Thanks gkanapathy. Good point about the license. I was just using two as an example, we are pushing for four. Since we value performance over availability (at least until it goes down, LOL) I'm assuming a single cluster of four search peers is better than two mirrored clusters of two search peers.

Under option one (splitting the data among the search heads) is this best managed by load-balancing or deliberately dividing up where forwarders are sending their data?

DISTRIBUTED SEARCH: performance -vs- highavailability

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Detection Engineering Office Hours: Real-World Troubleshooting & Q&A

Developer Spotlight with Mika Borner

Continue Your Federation Journey: Join Session 3 of the Bootcamp Series

Join the Conversation