Getting Data In

Do you think splunk could scale to 1 petabyte a day?

oreoshake
Communicator

Do you think splunk could scale to 1 petabyte a day?

What is the amount indexed by the largest installation out there?

Tags (2)

gkanapathy
Splunk Employee
Splunk Employee

A petabyte is more than 1000 terabytes. Several terabytes are challenging but doable, but a thousand of them is quite different. Note that if I'm being generous (and assuming data input problems are solved), then simply doing the indexing would require more than 2000 indexer nodes. A more realistic number is probably closer to 10,000 nodes.

So I would say no, not in practice, not today, not without a few (proposed and probably coming someday) architectural enhancements. Problems that would have to be addressed include:

  • Node reliability. There will always be one or more nodes down at any one time at this scale. If you don't care too much about completeness or absolute correctness of results, this might not be a huge obstacle.
  • Space. I guess not really a Splunk problem, but while 1 petabyte sounds doable, that's just one day of data. Is that all you want online?
  • Network scale-out. I doubt the current (flat) distributed search architecture will be performant or reliable when working over X thousands of nodes. (Especially if you must assume that some number of them will be unreachable for whatever reason.)

That's a start.

dskillman
Splunk Employee
Splunk Employee

I would say maybe with reservations.

  1. The number of boxes you would need would be somewhere on the order of 5000-10000 dual Quad cores. Maybe with super high end machines you could get that number into the low thousands.

  2. Your storage needs would probably have to come from local disk rather than SAN seeing as you would need a separate SAN for every 4-5 days of storage. ( Compression of ~50% of raw + the index)

  3. I don't think the current distributed search architecture would like the number of Indexers you would have to connect to to run a search across all boxes but we could theoretically have multiple tiers.

A couple of other questions would obviously come up and help with even getting an idea of feasibility:

How long would you want to store the data?

How fast would you want results?

Do you have to keep all of the data, or could a big chunk of the data be summarized?

What type of searches do you think you would use? Reports? Needle in a haystack?

This would be a crazy hard problem to solve but I think Splunk would have a decent chance of making it happen depending on the use cases and availability to a serious amount of hardware. It would be fun to test. Got a lab we can use? 🙂

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...