Getting Data In

Splunk Data Retention

Karthikeya
Communicator

we got a requirement to on-board new platform logs to Splunk. They will have 1.8 TB/day data to be ingested. As of now our license is 2 TB/day and we already have other platform data on-boarded. Now these new ones accepted to uplift our license with 2TB/day more so now our total becomes 4TB/day.

But here they said that their normal ingestion is 1.8 TB/day, but during DDOS attack it can go in double digits. We got surprised by this. Total itself is 4TB/day, how come we can handle double digits TB of data, which in return this project might impact the on-boarding of other projects.

My manager asked me to investigate on this whether we can accommodate this requirement? If yes, he want the action plan. If not, he want the justification to share it with them.

I am not much aware of these licensing things in Splunk, but as per my knowledge this is very dangerous because 4TB and 10/20TB per day is huge difference.

My understanding is, if we breach 4TB/day (may be 200gb of data more), new indexing stops but still old searches can be accessed. 

Our infrastructure: multi site cluster with 3 sites ... 2 indexers in each (total 6), 3 SHs one in each, 1 deployment server, 2 CMs (active and standby), 1 deployer (which is license master.)

Can anyone please help me on this topic how to proceed on it?

Labels (1)
0 Karma

Karthikeya
Communicator

@gcusello @isoutamo @PickleRick @richgalloway 

Update what I recently saw in my architecture:

indexes.conf in Cluster Manager:

[new_index]

homePath = volume:primary/$_index_name/db

coldPath = volume:primary/$_index_name/colddb

thawedPath = $SPLUNK_DB/$_index_name/thaweddb

volumes indexes.conf:

[volume:primary]

path = $SPLUNK_DB

#maxVolumeDataSizeMB = 6000000

there is one more app which is pushing to indexers with indexes.conf: (not at all aware of this)[default]

remotePath = volume:aws_s3_vol/$_index_name

maxDataSize = 750

[volume:aws_s3_vol]

storageType = remote

path = s3://conn-splunk-prod-smartstore/

remote.s3.auth_region = eu-west-1

remote.s3.bucket_name = conn-splunk-prod-smartstore

remote.s3.encryption = sse-kms

remote.s3.kms.key_id = XXXX

remote.s3.supports_versioning = false

So I believe that we are using Splunk Smartstore to store our data... So in this case can we accommodation this project which receives 20TB of data per day occasionally? Please guide me

0 Karma

PickleRick
SplunkTrust
SplunkTrust

As much as we're trying to be helpful here, this is something you should work on with your local friendly Splunk Partner.

As I said before, your environment already seems undersized in terms of number of indexers but you might have an unusual use case in which it would be enough. It doesn't seem to be enough for the 20TB/day peaks.

You also need to take into account the fact that when you ingest that amount of data you have to upload it to your S3 tenant. Depending on your whole infrastructure that might prove to be difficult when you hit those peaks.

But it's something to discuss in details with someone at hand with whom you can share your requirements and all limitations in details. We might have Splunk knowledge and expertise but from your company's point of view we're just a bunch of random people from the internet. And random people's "advice" is not something I'd base my business decisions on.

Yes, I know that consulting services tend to cost money but then again, failing to properly architect your environment might prove to be even more costly.

Karthikeya
Communicator

@PickleRick You also need to take into account the fact that when you ingest that amount of data you have to upload it to your S3 tenant. Depending on your whole infrastructure that might prove to be difficult when you hit those peaks. ---> I am thinking S3 bucket storage is something unlimited because when I am checking MC it is showing Home path and Cold Path index storage is unlimited... Is it wrong assumption? 

1000099072.jpg

0 Karma

PickleRick
SplunkTrust
SplunkTrust

I'm trespassing into a territory a bit unknown to me (others have more experience with smartstore so the might correct me if I go wrong somehwere) but even if from Splukn's point of view the storage is "unlimited":

a) You might have limits on your S3 service

b) You will pay more if you use more data (that might not be a deal breaker for you but it's worth being aware of it)

c) You still need bandwidth to push it out of your local environment. If you don't have enough bandwidth you might clog your indexers because they will not be able to evict buckets from cache.

0 Karma

isoutamo
SplunkTrust
SplunkTrust
I totally agree with @PickleRick that you needs some local Splunk Partner or sales engineer to go through your current setup and how to proceed with it! To help you we need much more information and also we must see your whole environment, use cases and also understand your business to make any real suggestions to you.
If/when you want know more about SmartStore I suggest that you join Splunk’s slack and read and asking more there. https://splunkcommunity.slack.com/archives/CD6JNQ03F
0 Karma

Karthikeya
Communicator

@PickleRick yes I do understand your point I won't make decisions here but I want to gain knowledge from experts here because as I told I am still learning things.... 

So my understanding is however we store old data in S3 buckets once they roll from hot to warm... So I didn't understand why indexers are considering undersized (6) because however indexers not storing data here right at the end S3 bucket stores 90% of data (even if 20TB/day comes occasionally)? Are we looking in terms of CPU whether indexers can handle unusual 20TB of day at a time? What will be the consequences for that? And I believe index size default of 500 GB will not fill at all because Maxdatasize is set to 750 MB which means new data which is crossing 750 MB will roll over to warm buckets (which are there in S3 bucket)? Sorry if I am speaking wrong but that's my understanding. 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's not only about data size (but still, remember that you have to cache the data somewhere. With 20TB/day searches over two-three days might prove to be difficult to handle locally.

But it's also about processing power. You have a total of 6 indexers which might or might not have equally distributed ingestion load. Depending on your environment and load characteristics this is - in a normal case - somewhere between "slightly undersized" to "barely breathing". But again - your environment might be unusual, your equipment might be hugely oversized vs. the standardized specs and tuned to utilize the hardware power (although usually with indexers you'd rather go into horizontal scaling instead of pumping up the specs of individual indexers).

So there are many factors at play here. And I suppose you're not the one who'll be making the business decisions. 😉 The more reason to get something to cover your bottom.

0 Karma

Karthikeya
Communicator

but still, remember that you have to cache the data somewhere. ---> cache the data means? Can you please brief on this...

With 20TB/day searches over two-three days might prove to be difficult to handle locally. ---> Can't SmartStore help us here? Locally only hot bucket will be there right remaining days will be automatically rolled to S3 buckets and we can search from there no? 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

As it's already been said - with Smartstore it works like this (maybe oversimplifying a bit):

1) Splunk ingests data into hot bucket (and replicates this hot bucket live to replication peers).

2) The bucket is finalized and rolled to the cache storage.

3) Cache manager marks the bucket as queued for upload.

4) When the bucket gets uploaded to the remote storage it _can_ be evicted (removed from local cache storage).

But

5) Splunk needs local cache storage for buckets downloaded from remote storage from which it searches data.

So you have multiple mechanisms here.

1) Splunk does the indexing locally (the hot buckets are local).

2) Splunk must upload the bucket to the smartstore

3) Splunk _only_ searches against locally cached data. There is no way to search from a bucket stored remotely. If Splunk needs to search from a bucket the cache manager must firstly fetch that bucket from remote storage to local cache.

 

Karthikeya
Communicator

This is want present in that app's server.conf

[cachemanager]

max_concurrent_uploads = 8

#eviction_policy = noevict

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's not only about the number of uploads. It's also about the bandwidth and data usage.

You can compare smartstore to swap space on your server (with a small difference that once a bucket is "swapped out" since it doesn't change it doesn't get re-uploaded again).

If your applications request more memory than your server has, some pages are getting swapped out by the kernel to free some physical memory. But CPU cannot interact with pages on disk so if you need to access data from those pages they have to be read back into physical memory. Since disk is usually way slower than RAM (ok, nowadays with NVMe storage those differences aren't as huge as they were even a few years back but still...) the kernel starts getting more and more occupied with juggling pages in and out and your system's load soars sky high and the whole system becomes unresponsive.

Same can happen with smartstore-enabled indexes. If the buckets are not yet uploaded, they occupy your "RAM". When you need to search from a bucket which is not present in the local cache (your "RAM"), cache manager has to fetch that bucket from smartstore which is relatively slow compared to reading it directly from disk. If another search requires another bucket, cache manager queues fetching another bucket. And so on.

If you need to access sufficiently many different buckets from smartstore-enabled indexes you may end up with a situation when a bucket is getting fetched to the local cache only to be read once and then immediately evicted and having to be re-read from remote storage next time it's needed. Cache manager might be using more sophisticated caching policies than simple FIFO (to be honest, I didn't dig that deep into this topic so I'm not sure if it's a simple LRU or something more sophisticated) but you can't beat physics and math. If you have only enough local storage for X buckets you can't use it to store 2X or 3X buckets. They simply won't fit.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. Summing up what's been already said and then adding some.

The amount of data you're receiving affects several things:

1) Licensing. While it is indeed true what @gcusello pointed at - you can exceed your license for some time but this is meant for some unforeseen unusual situations. You should not rely on constatntly exceeding your license. Even if it does technically work (and judging by your license size your license is most probably a non-enforcing one which means it will only generate a warning), that's not what you bought. And any contact with Splunk (be it support case, be it a question for license extension) might end up with uncomfortable questions about your license size and real usage. Of course if this is something that happens just once in a while, that's OK. And BTW, if you exceed your ingestion limit it's the searching which gets blocked with an enforcing license, not indexing - you will not (contrary to some competitors' solutions) lose your data.

2) Storage - this is kinda obvious. The more data you're ingesting, the more storage you need to hold it given constant retention period. Since Splunk rolls buckets from cold to frozen (by default that means deleting the data) based on size limit or age limit, whichever is hit first that means that if you don't have enough space allocated and configured for your indexes, even if you are able to ingest that additional amount of data, it will not be held for long enough because it will get deleted due to lack of space. So instead of holding data for - let's say - last two weeks, you'll have only two days of data because the rest will have been pushed out of the index.

3) Processing power. There are some guidelines to sizing Splunk environments. Of course the real life performance may differ compared to the rule of thumb for generalized cases but still your cluster seems relatively small even for the amount of data you're receiving now (depending on how evenly spread the ingestion is across your sites it might be already hugely undersized), not to mention additional data you'll be receiving normally and definitely not adding the DDOS data. If you  overstress the indexers you will clog your pipelines. That will create a pushback beause the forwarders won't be able to forward their data to indexers. So they might stop getting/receiving data from their sources. It's only half-bad if the sources can be "paused" and queried later for the missing data so you'll only cause lag. But if you have "pushing" sources (like syslog), you'll end up losing data.

So licensing is the least of your problems.

isoutamo
SplunkTrust
SplunkTrust

One comment more. If you are already indexing that amount of data with so few indexers I’m really surprised that you have ingestion based license! Especially when you normal amount is “small” but time by time DDoS can double those, I propose that you should ask CPU based (svc in cloud, some other name in onprem) licensing model. 
Anyhow as other said you must rearchitect your environment and add nodes and disk base based on your average daily usage and needed retention time and queries needed to run. For that you need someone local person to discuss your scenarios and needs. 

gcusello
SplunkTrust
SplunkTrust

Hi @Karthikeya ,

you can exceed the license limit without any violation (only a message) for 45 times in 60 solar days.

So it shouldn't be a problem you situation.

for more information see at https://www.splunk.com/en_us/resources/splunk-enterprise-license-enforcement-faq.html?locale=en_us

Ciao.

Giuseppe

0 Karma

Karthikeya
Communicator

But how will our indexers accommodate this? That is my question here? We have 6 indexers with 6.9 TB disk space. What happens if we exceed this space in single day? 

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Karthikeya ,

is the question is on the license exceedings, you don't have problems exceeding less than 45 times in 60 days.

if the problem is the storage,

you could change the dimension of the index where these logs are stored so they will be deleted more frequently and you will not use all the disk space.

You could also change this max dimension when you have excessive data ingestion and then restore the normal parameter at the end, anyway the easiest method is configure the max dimension for your indexes.

Ciao.

Giuseppe

0 Karma

Karthikeya
Communicator

@gcusello if the problem is the storage -- Yes problem is the storage - we have 6.9TB in each indexers of 6 indexers. 

you could change the dimension of the index where these logs are stored so they will be deleted more frequently and you will not use all the disk space. - how to do this? Please explain more sorry we have volumes configured on our environment.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

In addition to adding storage, consider increasing the number of indexers.  Unless the indexers are very over-powered, you probably will need more of them to ingest double the amount of data.

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...