Solved: Confusion about setting up data model acceleration

munang · ‎05-04-2024

Hello. I'm a Splunk newbie.

There is confusion about setting up data model acceleration.

According to the official documentation, if the data in your data model is out of date, Splunk will continuously delete it and keep the data in your data model up to date.

So, for example, if you summarize a month's data model in 0 12 * * *cycles,

1. -30 to 0 days data summarized
2. Day after day
3. Data from day -29 to +1 is summarized.
4. -30 days data is deleted

Is this process correct?

If this process is correct, why is it being done this way?

And, information summarized through data model acceleration

Is there a way to keep them consecutively like a summary index without them being deleted?

PickleRick · ‎05-07-2024

Yes. Bucketing data works like it does with normal events within the index. The whole bucket is rolled when the _most recent_ event in the bucket is older than the retention period for the index - that's why you can have data older than retention period in your index. Here it works the same way - if your bucket "overlaps" the boundary of the summary range, the whole bucket will be available.

View solution in original post

PickleRick · ‎05-05-2024

The overall idea is more or less correct but the details are a bit more complicated than that.

1. The summary-building search is spawned according to schedule and builds the summary data similarily to writing indexed fields when ingesting data (in fact accelerated summaries are stored the same way it .tsidx files as indexed fields are).

2. The accelerated summaries are stored in buckets, corresponding with buckets of raw data.

3. The old buckets are not removed by the summary-building process but - as far as I remember - by the housekeeper thread (the same that is responsible for rolling event buckets).

So it's not that straightforward FIFO process. Also the summary range is not a 100% precise setting. Due to data being stored in buckets and managed as whole buckets you might still have some parts of your summaries exceeding the defined summary range.

Another thing worth noting (because I've seen such questions already) - no, you cannot have longer acceleration range than event data retention. When the event bucket is rolled to frozen, the corresponding datamodel summary bucket is deleted as well.

munang · ‎05-06-2024

@PickleRick

Thank you very much for your reply.

So, what you are saying is that the data model summary index is tsidx, and this index is also stored in the bucket.

Are you saying that the life cycle in which this tsidx is created and deleted is the same as the bucket rolling rule that the existing index had? (hot/warm -> cold -> frozen )

However, there is something I am still confused about.

Is the entire data being summarized every 5 minutes due to bucket summary even though there is overlapping data as much as the range as shown in the picture above?

And is it right to keep the entire thing for the retention period?

Or is only the most recent data kept for the retention period?

PickleRick · ‎05-06-2024

The datamodel accelerated summary is indeed stored in a bucket but it can (but usually isn't) stored on a different piece of storage (in a different directory or on a different volume). But still is organized in buckets and each raw data bucket has its corresponding DAS bucket.

You're still thinking in terms of just time periods whereas data is stored and rolled by buckets. Buckets can have overlapping time ranges or even one can "contain" another's whole time range.

Also, the time range for summaries works a bit differently. The backfill range doesn't produce duplicates. It updates the data (not sure about the gory technical details underneath - maybe it does internally keep some duplicates and just shows the most current generation but effectively it just shows the "current" state) - it's meant as a way to keep the DAS current even if some lagged events are ingested after the initial summary search run had already been done.

So don't try to overoptimize too early 😉

munang · ‎05-06-2024

@PickleRick

Thank you very much for your reply.

And, Sorry for asking so many questions.

I looked into the backfill range and found that it is a value related to the load of the system.

Is it summarized separately from the previously set summary range?

1. The summary range is 1 month, and if the summary is stopped due to system load in the middle, it is replaced with a backfill range.
(= Use as backfill range if summary range does not work)

2. The summary range is 1 month, and after summarizing all of the data for one month, the next 7 days of data are filled as a backfill range.
(= operates as summary range + backfill range)

Which of the two concepts is more accurate?

PickleRick · ‎05-06-2024

I find the description in the docs a bit confusing.

The summary range is the logical equivalent of the retention period for the indexes - it tells you for how long (approximately - see the remark on buckets) the DAS will be stored.

The backfill range is for how long back the data will be searched and which range the summarization search will update with each run. So that backfill range of 15 minutes means that each summarization search will be launched with "earliest=-15m".

Those parameters are not directly related to system load but they can affect system load. And system load can affect summarization searches.

Since there is a limit for concurrent summarization searches and the summarization searches have lowest priority of all searches, the summarization search parameters can influence if the searches successfully run at all.

For example, if you have a fairly active index storing network flows and you set up backfill range of a year and tell Splunk to spawn summarization search every minute and additionally limit concurrent summarization searches to 1 there is no way in heaven and hell that this configuration ever runs without skipping searches. And if you set the maximum summarization search run-time to 5 minutes, your acceleration will probably never build the summaries because each run will be spawned and killed in 5 minutes intervals (not every single minute as you'd be hitting the maximum concurrent searches limit).

munang · ‎05-07-2024

@PickleRick

Thank you very much for your reply.

If this is what you said, as shown in the picture below
The summary range refers to the period for which data is preserved, and the backfill range refers to the period from which summarization launches.

Also, can I understand that the fact that there is data that is physically outside the actual summary range is because the summary data is stored in bucket units and rolled in bucket units!?

PickleRick · ‎05-07-2024

Yes. Bucketing data works like it does with normal events within the index. The whole bucket is rolled when the _most recent_ event in the bucket is older than the retention period for the index - that's why you can have data older than retention period in your index. Here it works the same way - if your bucket "overlaps" the boundary of the summary range, the whole bucket will be available.

munang · ‎05-07-2024

@PickleRick

Thank you so much for answering my questions over such a long period of time.

Thanks to you, I understand what was confusing about the data model.

Reading the docs again, I realized I had been thinking in a different direction.

thank you.

gcusello · ‎05-05-2024

Hi @munang,

at first, you can configure the retention you want for your Data Model, so if you want a longer retention time, you can configure it, you need only more storage: the requested storage for one year for data Models is around 3.4 times the average of daily indexed data.

Then Accelerated data Models are usually used for the most searches that must be very fast, if you need to search in older data, you can also use your data in indexer or summary indexes.

As I said, usually in the last 30 days there are the data of more than 85% of the searches, that you need they are faster.

Ciao.

Giuseppe

munang · ‎05-05-2024

Hi, @gcusello

Thank you very much for your reply.

However, there is something I am still confused about.

1. Exact meaning of data retention period

For example, if you set the data retention period to 1 year,

Does initial acceleration mean that the summarized data will be kept for 1 year?

2. Meaning of data summary scope

Assuming that one month's data is set as the summary range and the cron expression is set to */5 * * * *,

If one month's worth of data is summarized every 5 minutes, the latest data continues to be summarized every 5 minutes.

If it becomes past data, will it be deleted?

I would appreciate your reply.

Thank you

gcusello · ‎05-05-2024

Hi @munang ,

answering to your questions:

1)

you'll have a 1 year data in your DM,

if you have 1 year data in your indexes, you'll load them in the DM, if you have less period data, you'll load all the data and maintain them for 1 year.

2)

I don't fully understand your question:

you load in the DM the last 5 minutes data every 5 minutes; when your data exceed the retention period, they will be deleted.

Ciao.

Giuseppe

munang · ‎05-06-2024

@gcusello

Thank you very much for your reply.

Then, can I understand that the Summary Range used when defining data model acceleration is not the period for which data is stored, but the period that ensures that data is kept up-to-date, and that data is preserved according to the retention period set in the index?

Looking through old posts, it seems a bit confusing.

https://community.splunk.com/t5/Reporting/How-far-back-in-time-will-datamodel-data-be-kept/m-p/13725...

gcusello · ‎05-06-2024

Hi @munang ,

retention in indexes is a different (and not related) thing than DM acceleration period: they can be (and usually they are) different: data are stored in indexes for the retention period, instead data in DM are stored for the time of the most searches.

Ciao.

Giuseppe

Confusion about setting up data model acceleration

other

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)