Getting Data In

Do you think splunk could scale to 1 petabyte a day?


Do you think splunk could scale to 1 petabyte a day?

What is the amount indexed by the largest installation out there?

Tags (2)

Splunk Employee
Splunk Employee

A petabyte is more than 1000 terabytes. Several terabytes are challenging but doable, but a thousand of them is quite different. Note that if I'm being generous (and assuming data input problems are solved), then simply doing the indexing would require more than 2000 indexer nodes. A more realistic number is probably closer to 10,000 nodes.

So I would say no, not in practice, not today, not without a few (proposed and probably coming someday) architectural enhancements. Problems that would have to be addressed include:

  • Node reliability. There will always be one or more nodes down at any one time at this scale. If you don't care too much about completeness or absolute correctness of results, this might not be a huge obstacle.
  • Space. I guess not really a Splunk problem, but while 1 petabyte sounds doable, that's just one day of data. Is that all you want online?
  • Network scale-out. I doubt the current (flat) distributed search architecture will be performant or reliable when working over X thousands of nodes. (Especially if you must assume that some number of them will be unreachable for whatever reason.)

That's a start.

Splunk Employee
Splunk Employee

I would say maybe with reservations.

  1. The number of boxes you would need would be somewhere on the order of 5000-10000 dual Quad cores. Maybe with super high end machines you could get that number into the low thousands.

  2. Your storage needs would probably have to come from local disk rather than SAN seeing as you would need a separate SAN for every 4-5 days of storage. ( Compression of ~50% of raw + the index)

  3. I don't think the current distributed search architecture would like the number of Indexers you would have to connect to to run a search across all boxes but we could theoretically have multiple tiers.

A couple of other questions would obviously come up and help with even getting an idea of feasibility:

How long would you want to store the data?

How fast would you want results?

Do you have to keep all of the data, or could a big chunk of the data be summarized?

What type of searches do you think you would use? Reports? Needle in a haystack?

This would be a crazy hard problem to solve but I think Splunk would have a decent chance of making it happen depending on the use cases and availability to a serious amount of hardware. It would be fun to test. Got a lab we can use? 🙂

Get Updates on the Splunk Community!

Take the 2021 Splunk Career Survey for $50 in Amazon Cash

Help us learn about how Splunk has impacted your career by taking the 2021 Splunk Career Survey. Last year’s ...

Using Machine Learning for Hunting Security Threats

WATCH NOW Seeing the exponential hike in global cyber threat spectrum, organizations are now striving more for ...

Observability Newsletter Highlights | March 2023

 March 2023 | Check out the latest and greatestSplunk APM's New Tag Filter ExperienceSplunk APM has updated ...