Solved: How to to check data size indexed in indexers per...

pacifikn · ‎04-21-2022

Greetings!!

1.a. I need to check data size indexed in indexers per day, per month and per year in GB?

1.b. what if the data ingested per day is 200GB/day, How do I calculate to know the storage that can store all the indexed data in 5 years? or one year? and month?

2- how to install and configure indexers to be functioning?

3- How to configure syslog in splunk instance to receive logs? i have already configured network devices to send logs into splunk instance? what other steps remaining to do to receive logs in indexer?

Kindly help me, Thank you in advance

PickleRick · ‎04-22-2022

I meant greping for literal 'Path =' string. This will give you list of all paths used with all indexes. Like this (an excerpt from one of my test indexers:

[...]
tstatsHomePath = volume:_splunk_summaries/fishbucket/datamodel_summary
tstatsHomePath = volume:_splunk_summaries/$_index_name/datamodel_summary
coldPath = $SPLUNK_DB/historydb/colddb
homePath = $SPLUNK_DB/historydb/db
thawedPath = $SPLUNK_DB/historydb/thaweddb
tstatsHomePath = volume:_splunk_summaries/historydb/datamodel_summary
coldPath = $SPLUNK_DB/defaultdb/colddb
homePath = $SPLUNK_DB/defaultdb/db
thawedPath = $SPLUNK_DB/defaultdb/thaweddb
tstatsHomePath = volume:_splunk_summaries/defaultdb/datamodel_summary
coldPath = $SPLUNK_DB/splunklogger/colddb
homePath = $SPLUNK_DB/splunklogger/db
[...]

But as you can see - it's not just a clean list of paths and it needs some processing.

Anyway, any disk-based calculation would have to involve comparing the values over several days because simple du (regardles of whether used on whole /opt/splunk/var/lib/splunk or on single paths) gives you disk usage at given moment in time. It's a simple file sizing tool, it doesn't know anything about splunk or any other software installed on your system. So it cannot tell you anything about any history. So you'd have to do those calculations over several days and compare values to see how much the usage grows.

There is one caveat though (as usual ;-)). The indexes have a limit of bucket lifetime and a limit of the index size. So if the events grow too old, they are being removed from an index. The same applies to the situation when the index grows so much it hits the size limit - the oldest bucket is deleted from the index even if it didn't reach its expiry period.

So if you have a relatively "stable" setup - running for some time already, the disk usage might not be growing anymore so you'd have to do the calculations another way but that's way beyond this topic.

Anyway, in your case the 47263660 is the number of kilobytes consumed at the moment of running the command by your splunk data. It includes indexes, as well as some internal splunk data (but mostly indexes if we're talking about this order of magnitude). If I count the digits correctly, it's about 47GB which - again, judging from the fact that you have 5 indexers, assuming that the load is relatively balanced means you should have about 240GB altogether. Which means _probably_ about 480GB worth of raw data that you have indexed here. With daily consumption of 40GB it would be about 12 days of data. All that assuming you have replication factor=search factor=1 (you don't use replication). If you're using replication and have RF=SF=2, it means just 6 days of data.

This topic is getting more and more complicated and we're getting into much much detail. Honestly, if you're gonna buy several terabytes worth of storage (or deploy more indexers), I'd advise you to spend some relatively small amount of money in comparison to this deployment and get a help from Professional Services or Splunk Partner in your area who will review your setup and configuration and will advise you based on your particular case.

View solution in original post

pacifikn · ‎04-22-2022

Thank you so much for your kind help @PickleRick .

PickleRick · ‎04-21-2022

1a. There are several approaches to checking how much data you ingested but the easiest is probably the historical license report from the monitoring console.

1b. Typically you assume 15% for compressed raw data and 35% for indexes which means about 100GB of storage used per day. Just multiply it by the needed number of days and you have an approximate storage requirements. But there are two caveats here. One is that if you're using accelerated data models, the indexed summaries are using additional space. Another one is that the calculation doesn't include adjusting it for replication/search factors.

2. It's not something that can be explained in short paragraph. See https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchTutorial/InstallSplunk

Additionally, if you're planning on clustering your setup it gets even more complicated.

3. You can just listen on a network port (create a udp or tcp input) but it's good for a small-scale installation (usually a lab one). For bigger setups you'd rather use an external syslog daemon either writing to files which you'd ingest to splunk with a forwarder or sending to a HEC input.

As a general word of advice - production deployment of splunk environment is not that easy especially if you hadn't done a single installation before. Try to do an installation in a lab/testing environment to get to understand how it works.

pacifikn · ‎04-21-2022

Hello dear @PickleRick , thank you for your feedback,

But i didn't well understood, let me again explain well my question in details,

Actually I have a license of 200GB/per day?

but when I check is license used previous 30 days, the average is between 36GB and 54GB , means per day i cannot pass 60gb,

I also have 5 indexers and each has the space of 2T , BUT I WANT TO KEEP MY DATA FOR LONGTIME FOR 5YEARS, HOW DO I CALCULATE THIS TO GET THE STORAGE FIT THE 5 YEARS?which query or formula can i use?

HOW MANY Terabyte(TB) I WILL NEED TO KEEP MY DATA FOR 5YEARS in the scenario above,

Thank you in advance!

PickleRick · ‎04-21-2022

License usage is calculated by cumulative size of all raw data being written to your indexes (not uncluding the stash sourcetype and _internal index if I remember correctly).

So if your license report shows 36 or 58GB indexed during the day it means that that'smhow much data you got. And you still have pretty much "space" meaning that you can onboard more sources and you'll still be within your license quota.

As I said - the actual storage requirements are not that straightforward to calculate.

In the simplest case - if you have, let's say, 50GB per day of indexed raw data, it's usually more or less around 25GB (15% for compressed raw data, 35% for indexes) per day which means around 750GB per 30 days and so on. So if you want to keep your data for 5 years you'd need 5*365*25GB. Plus some overhead "just in case" and some space for accelerations.

But that's a case when you have a single copy of each event. If you want to have a search factor and replication factor of 2, which means that at every moment you have two searchable copies of each bucket, you'd need twice that amount (ideally with some more spare space for replication in case of an indexer failure).

If you want other replication/search factors, it's getting more complicated.

And please don't shout 😉

pacifikn · ‎04-22-2022

Thank you so much @PickleRick for your kind support and response,

I got now the point, may you please share/guide me how to use query or other way i can use to check on GUI/CLI the total data indexed per day in GB after this exercise (15% for compressed raw data, 35% for indexes)?

Thank you in advance.

PickleRick · ‎04-22-2022

You can click under the chart on the small looking glass icon - it's the "open in search" action. It will pivot you to the search app where it will run the search responsible for populating the data for the chart.

It expands to

index=_internal 
[ `set_local_host`] source=*license_usage.log* type="RolloverSummary" earliest=-30d@d 
| eval _time=_time - 43200 
| bin _time span=1d 
| stats latest(b) AS b by slave, pool, _time 
| timechart span=1d sum(b) AS "volume" fixedrange=false 
| join type=outer _time 
[ search index=_internal 
[ `set_local_host`] source=*license_usage.log* type="RolloverSummary" earliest=-30d@d 
| eval _time=_time - 43200 
| bin _time span=1d 
| dedup _time stack 
| stats sum(stacksz) AS "stack size" by _time] 
| fields - _timediff 
| foreach * 
[ eval <<FIELD>>=round('<<FIELD>>'/1024/1024/1024, 3)]

But it shows how much data counts against your license quota.

If you want to see how much space on disk the index files take, just check /opt/splunk/var/lib/splunkd subdirectories (but be careful not to delete anything; that's your splunk data; you've been warned!)

I think you still don't have a clear distinction about different kinds of data and calculations concerning them.

Let's assume you're receiving 100GB of data daily from your sources (via syslog inputs, monitor inputs and so on).

This 100GB of data is getting into your indexers and/or HFs which parse indexed fields from them, sometimes transform them and filter some events out.

For the sake of this example, let's assume in those 100GB you have 20 GB of events which come from some unimportant application which you don't care about and you don't need it indexed. So you have a props/transforms configuration redirecting those events to nullQueue instead of indexing them. Effectively you're left with 80GB of raw data to be written to indexes.

Indexers count this 80GB against your license quota and write them to appropirate files. They write compressed raw events into the raw event files - for calculations it's usually assumed that it takes about 15% of the original events size but that can vary depending on the events' size and content. In our case those files should use about 0.15*80GB = 12GB of disk space. Additionally indexers write metadata (token indexes, indexed files and so on) to index files. Again - the size of those files can vary but for a "typical case" we usually assume that it takes another 35% of the original raw data size. So in this case it's about 28GB.

So from your original 100GB of data after filtering out 20GB of useless data you're left with 80GB of events which are indexed and counted against your license. This 80GB of raw events should take about 40GB worth of files on the disk. So extending the "lifetime" of your data does not change your license requirements (but remember that you need an active license to be able to search your data) but your storage need (for event storage) grow linearily with expiration time. 80GB consumed storage per day means 560GB per week, 2.4TB per month and so on.

Your current license shows indeed that you're consuming about half of your license or so. So you still have "license room" to ingest more data without incurring license violation.

As I said, there will be another space needed for additional functionalities if you're using them but that varies greatly depending on how many accelerated summaries you use, what period they cover and so on. So it's not that easy to approximate.

The space calculation presented above covers all data processed by your indexers so if you have just an all-in-one installation you'd have to have all those terabytes in a single machine. If you have a 5-node cluster you'd need about 1/5 of this space per each node of the cluster. As long as your inputs are done properly and you're getting the data distributed evenly across the nodes. Otherwise you might end up with some indexers clogged with data while others left empty.

And then there is a replication factor issue. https://docs.splunk.com/Documentation/Splunk/8.2.6/Indexer/Thereplicationfactor Long story short - if you want to have two separate copies of each bucket of events which you can use for searches, and want to be secured against a single indexer going down, you need twice as much storage space (bigger indexers or more of them). You don't however need more license. Regardless of your replication factor it's the amount of how much data you're getting into the indexes, not how many copies of this data you end up having. There are additional factors for calculating needed storage for various replication/search factors but that's a simplified approximation.

I hope it's a bit clearer now 🙂

pacifikn · ‎04-22-2022

Thank you again @PickleRick for your quick response,

- Is there any query that can show me the final volume of data indexed in indexers per day in GB? * in CLI / GUI?

-Thank you in advance

PickleRick · ‎04-22-2022

Depending on what you mean by "final volume".

the search I showed you and - in general - the reports in license report show you how much data was indexed in terms of license usage (which means that 80GB from my example).

But if you want to see how much data is consumed on disk per each day... well, it's not that easy.

Firstly, you'd need to know where your indexes are stored. Typically they are in /opt/splunk/var/lib/splunk (that directory also contains some other splunk internal data) but you can have different paths for various indexes and even have different storages for different types of buckets (hot/warm, cold, optionally frozen).

You can list your paths (you'd need to do it on each indexer separately) using:

splunk btool indexes list | grep 'Path ='

But the output of this command might include $SPLUNK_DB variable (which is relatively easy to resolve) but also volume identifiers if you use multiple volumes for storage and $_index_name placeholders.

So it's a bit messy.

In a typical case it's enough to just do a

du -ksx /opt/splunk/var/lib/splunk

to get number of kilobytes used by splunk data (not just indexes).

Then you'd have to check it every day to see the difference.

You could try to calculate sizes of single buckets and try to "align" them to days but it's complicated.

pacifikn · ‎04-22-2022

Thank you so much,

Actually i wanted to see how much data is consumed on disk per each day.

splunk btool indexes list | grep 'Path ='

The above command , which Path you mean here?

the second command gives me this in one indexer:

[root@Splunkidx1 splunk]# du -ksx /opt/splunk/var/lib/splunk
47263660 /opt/splunk/var/lib/splunk
[root@Splunkidx1 splunk]#

Is the 47263660 , is the data consumed per day? if not how can i see the amount of data, disk consumed per day?

- I want to check all these and know , how much other storage to buy , so that i can be able to store more data for longtime?

-Thank you in advance.

PickleRick · ‎04-22-2022

I meant greping for literal 'Path =' string. This will give you list of all paths used with all indexes. Like this (an excerpt from one of my test indexers:

[...]
tstatsHomePath = volume:_splunk_summaries/fishbucket/datamodel_summary
tstatsHomePath = volume:_splunk_summaries/$_index_name/datamodel_summary
coldPath = $SPLUNK_DB/historydb/colddb
homePath = $SPLUNK_DB/historydb/db
thawedPath = $SPLUNK_DB/historydb/thaweddb
tstatsHomePath = volume:_splunk_summaries/historydb/datamodel_summary
coldPath = $SPLUNK_DB/defaultdb/colddb
homePath = $SPLUNK_DB/defaultdb/db
thawedPath = $SPLUNK_DB/defaultdb/thaweddb
tstatsHomePath = volume:_splunk_summaries/defaultdb/datamodel_summary
coldPath = $SPLUNK_DB/splunklogger/colddb
homePath = $SPLUNK_DB/splunklogger/db
[...]

But as you can see - it's not just a clean list of paths and it needs some processing.

Anyway, any disk-based calculation would have to involve comparing the values over several days because simple du (regardles of whether used on whole /opt/splunk/var/lib/splunk or on single paths) gives you disk usage at given moment in time. It's a simple file sizing tool, it doesn't know anything about splunk or any other software installed on your system. So it cannot tell you anything about any history. So you'd have to do those calculations over several days and compare values to see how much the usage grows.

There is one caveat though (as usual ;-)). The indexes have a limit of bucket lifetime and a limit of the index size. So if the events grow too old, they are being removed from an index. The same applies to the situation when the index grows so much it hits the size limit - the oldest bucket is deleted from the index even if it didn't reach its expiry period.

So if you have a relatively "stable" setup - running for some time already, the disk usage might not be growing anymore so you'd have to do the calculations another way but that's way beyond this topic.

Anyway, in your case the 47263660 is the number of kilobytes consumed at the moment of running the command by your splunk data. It includes indexes, as well as some internal splunk data (but mostly indexes if we're talking about this order of magnitude). If I count the digits correctly, it's about 47GB which - again, judging from the fact that you have 5 indexers, assuming that the load is relatively balanced means you should have about 240GB altogether. Which means _probably_ about 480GB worth of raw data that you have indexed here. With daily consumption of 40GB it would be about 12 days of data. All that assuming you have replication factor=search factor=1 (you don't use replication). If you're using replication and have RF=SF=2, it means just 6 days of data.

This topic is getting more and more complicated and we're getting into much much detail. Honestly, if you're gonna buy several terabytes worth of storage (or deploy more indexers), I'd advise you to spend some relatively small amount of money in comparison to this deployment and get a help from Professional Services or Splunk Partner in your area who will review your setup and configuration and will advise you based on your particular case.

How to to check data size indexed in indexers per day, per month and per year in GB?

configuration

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

How to to check data size indexed in indexers per day, per month and per year in GB?

configuration

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I