Getting Data In

Do you think splunk could scale to 1 petabyte a day?


Do you think splunk could scale to 1 petabyte a day?

What is the amount indexed by the largest installation out there?

Tags (2)

Splunk Employee
Splunk Employee

A petabyte is more than 1000 terabytes. Several terabytes are challenging but doable, but a thousand of them is quite different. Note that if I'm being generous (and assuming data input problems are solved), then simply doing the indexing would require more than 2000 indexer nodes. A more realistic number is probably closer to 10,000 nodes.

So I would say no, not in practice, not today, not without a few (proposed and probably coming someday) architectural enhancements. Problems that would have to be addressed include:

  • Node reliability. There will always be one or more nodes down at any one time at this scale. If you don't care too much about completeness or absolute correctness of results, this might not be a huge obstacle.
  • Space. I guess not really a Splunk problem, but while 1 petabyte sounds doable, that's just one day of data. Is that all you want online?
  • Network scale-out. I doubt the current (flat) distributed search architecture will be performant or reliable when working over X thousands of nodes. (Especially if you must assume that some number of them will be unreachable for whatever reason.)

That's a start.

Splunk Employee
Splunk Employee

I would say maybe with reservations.

  1. The number of boxes you would need would be somewhere on the order of 5000-10000 dual Quad cores. Maybe with super high end machines you could get that number into the low thousands.

  2. Your storage needs would probably have to come from local disk rather than SAN seeing as you would need a separate SAN for every 4-5 days of storage. ( Compression of ~50% of raw + the index)

  3. I don't think the current distributed search architecture would like the number of Indexers you would have to connect to to run a search across all boxes but we could theoretically have multiple tiers.

A couple of other questions would obviously come up and help with even getting an idea of feasibility:

How long would you want to store the data?

How fast would you want results?

Do you have to keep all of the data, or could a big chunk of the data be summarized?

What type of searches do you think you would use? Reports? Needle in a haystack?

This would be a crazy hard problem to solve but I think Splunk would have a decent chance of making it happen depending on the use cases and availability to a serious amount of hardware. It would be fun to test. Got a lab we can use? 🙂

Get Updates on the Splunk Community!

Don't wait! Accept the Mission Possible: Splunk Adoption Challenge Now and Win ...

Attention everyone! We have exciting news to share! We are recruiting new members for the Mission Possible: ...

Unify Your SecOps with Splunk Mission Control

In today’s post, I'm excited to share some recent Splunk Mission Control innovations. With Splunk Mission ...

Data Preparation Made Easy: SPL2 for Edge Processor

By now, you may have heard the exciting news that Edge Processor, the easy-to-use Splunk data preparation tool ...