Getting Data In

Do you think splunk could scale to 1 petabyte a day?

oreoshake
Communicator

Do you think splunk could scale to 1 petabyte a day?

What is the amount indexed by the largest installation out there?

Tags (2)

gkanapathy
Splunk Employee
Splunk Employee

A petabyte is more than 1000 terabytes. Several terabytes are challenging but doable, but a thousand of them is quite different. Note that if I'm being generous (and assuming data input problems are solved), then simply doing the indexing would require more than 2000 indexer nodes. A more realistic number is probably closer to 10,000 nodes.

So I would say no, not in practice, not today, not without a few (proposed and probably coming someday) architectural enhancements. Problems that would have to be addressed include:

  • Node reliability. There will always be one or more nodes down at any one time at this scale. If you don't care too much about completeness or absolute correctness of results, this might not be a huge obstacle.
  • Space. I guess not really a Splunk problem, but while 1 petabyte sounds doable, that's just one day of data. Is that all you want online?
  • Network scale-out. I doubt the current (flat) distributed search architecture will be performant or reliable when working over X thousands of nodes. (Especially if you must assume that some number of them will be unreachable for whatever reason.)

That's a start.

dskillman
Splunk Employee
Splunk Employee

I would say maybe with reservations.

  1. The number of boxes you would need would be somewhere on the order of 5000-10000 dual Quad cores. Maybe with super high end machines you could get that number into the low thousands.

  2. Your storage needs would probably have to come from local disk rather than SAN seeing as you would need a separate SAN for every 4-5 days of storage. ( Compression of ~50% of raw + the index)

  3. I don't think the current distributed search architecture would like the number of Indexers you would have to connect to to run a search across all boxes but we could theoretically have multiple tiers.

A couple of other questions would obviously come up and help with even getting an idea of feasibility:

How long would you want to store the data?

How fast would you want results?

Do you have to keep all of the data, or could a big chunk of the data be summarized?

What type of searches do you think you would use? Reports? Needle in a haystack?

This would be a crazy hard problem to solve but I think Splunk would have a decent chance of making it happen depending on the use cases and availability to a serious amount of hardware. It would be fun to test. Got a lab we can use? 🙂

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...