Solved: How datamodels work in Splunk?

Taruchit · ‎06-15-2023

Hello All,

I need your assistance to fetch the below details about Datamodels: -

1. What is the lifecycle of Splunk datamodel?

2. How Splunk logs events in _internal index when Splunk executes each phase of Splunk datamodel?

Any information or guidance will be helpful.

Thank you
Taruchit

PickleRick · ‎06-17-2023

OK. Again - I think you're mixing two different, but connected, things.

One is the datamodel itself. It's the abstraction layer that I described before. It's just a configuration definition. On its own it doesn't produce or store any data (OK, you may have calculated fields in the datamodel but that's not much different from calculated fields for your events).

So that's the datamodel as such - it defines what "layout" of data you should have and if you want your events to be your-datamodel-compatible you have to define your sourcetypes so that the events yield relevant fields. That's what most good TA-s do - they make the events CIM-compatible by providing proper aliases and calculcated fields. For example - if CIM Network Traffic datamodel defines a "dest_ip" field and your firewall data contains field named "dstip", a proper TA will contain a field alias which will present the dstip also as dest_ip so that it can be found by that name.

That's pretty much the datamodel itself is.

But another thing is datamodel acceleration. And that's what I'm talking about from the start. Datamodel and datamodel acceleration are two different things (although people often confuse them).

Datamodel acceleration is a mechanism which runs a scheduled search and creates a summary from the datamodel definition. So that if you search from datamodel splunk doesn't have to go through the raw data and extract all the fields on the fly but can get them from the indexed summaries. It's very similar to indexed fields (and even uses the same type of files in the backend).

So if you enable acceleration for a particular datamodel, splunk will run a datamodel acceleration summary search according to schedule configured with the datamodel acceleration and will update the summaries for given datamodel.

Then if you search from accelerated datamodel results for timerange covered by the already-built summaries will be retrieved from those pre-built summaries and for the most recent (or older than the acceleration time limit) events will be calculated directly from raw data.

If you specify "summariesonly=t" with your search (or tstats), splunk will use _only_ the accelerated summaries, it will not reach for the raw data (which means that your results will not contain most recent events - those that had not yet been summarized).

One important thing - datamodel acceleration is not available for Splunk Free whereas normal datamodel functionality is.

View solution in original post

VatsalJagani · ‎06-16-2023

@Taruchit - Just to summarize on what @PickleRick tried to explain:

Data Model is a structure of data.
- It just represents what should your data look like.
- It could to generalize many data sources to same data format (like CIM data-models)
  - Again, data-model does not do anything or execute anything, it just represent, Add-ons have to write the configuration to comply with the data-model. Or for custom data-models you have to write the configs.
Data Model Summarization / Accelerate
- When you have the data-model ready, you accelerate it.
- What it does:
  - It executes a search every 5 seconds and stores different values about fields present in the data-model.
  - For example, your data-model has 3 fields: bytes_in, bytes_out, group
    - Then data-model precomputes things like sum(bytes_in), sum(bytes_out), max(bytes_in), max(bytes_out), values(bytes_in), values(bytes_out), values(group), etc
- So, when you run | tstats query to find the results, Splunk don't have to search through the _raw data, instead look at summries created by Data-Model Acceleration.

I hope this helps!!! Kindly upvote if it does!!!

Taruchit · ‎06-16-2023

Thank you @VatsalJagani for sharing the details.

So, as I understand there is a change in my main question and I would want to get details of how to fetch logs of data model acceleration? Because data model acceleration runs at regular intervals, fetches and summarizes the data for usage of end users.

Thus, I would want to know how to find the set of events that occur for fetching and summarization of events. So that in the event data model acceleration fails to complete and by implication fails to fetch and summarize the latest events, I can use those logs to trace the problem and work on its resolution.

With trial and error, I have also got some logs related to data model acceleration with below SPLs: -

index=_internal seacrh_type="datamodel_acceleration"

index=_audit search_type="datamodel_acceleration"

However, if you could suggest more better approach or method to fetch the events, it would be very helpful.

Thank you

VatsalJagani · ‎06-16-2023

@Taruchit - Look at the monitoring console please.

Taruchit · ‎06-16-2023

@VatsalJagani : I do not have access to monitoring console. In that case how should I approach fetching results with SPLs?

Thank you

PickleRick · ‎06-16-2023

What do you mean by "lifecycle" in datamodel context? The typical mistake is to say "datamodel" but mean "accelerated datamodel summary data".

Again - what do you mean by "execute phases of datamodel"?

Taruchit · ‎06-16-2023

Hi @PickleRick,

I want to understand the activities that are carried out when a data model runs.

By lifecycle I meant, just like we have different stages of Data lifecycle in Splunk, Search Lifecycle in Splunk; what are the broad level stages which get executed when data model runs.

After understanding the stages of execution, I would want to understand the fetching and comprehending of corresponding logs that Splunk writes.

The goal to carry out this learning is to get backend visibility about data models, how they operate, what are the sequence of processes and tracing the logs that get recorded on completion of each process. These details will help in the scenario where data model fails to complete or fails to run and thus getting the fundamental understanding of process, understanding the context and finding the required logs generated by Splunk will together help to comprehend the issue and work on resolving the same.

Please share if you need any more clarification from my end.

Thank you

PickleRick · ‎06-16-2023

Again - there is no such thing as "when datamodel runs".

Datamodel is just a collection of definitions - usually a base search and list of (possibly calculated) fields.

So, simplifying it a bit. if you search for something "from datamodel", splunk translates your search to a search against your underlying data and runs this search.

And yes, I'm nitpicking here a bit because I want you to undrstand the difference between datamodel and its acceleration.

Taruchit · ‎06-16-2023

Hi @PickleRick,

Thank you for sharing your time and inputs and posting the details. It is very helpful to understand.

As I understand, datamodels help to store similar data and save time in fetching it by using separate SPLs. However, datamodels also execute searches at defined regular intervals to get the latest events which are then summarized, computed and stores on the indexers which can later be used by SPLs for faster access.

Thus, I am interested to learn about the backend process and stages of regular execution carried out with datamodels, computation, summarization and storing of results.

Please share if there is any correction needed in my understanding of the same.

Thank you

PickleRick · ‎06-17-2023

OK. Again - I think you're mixing two different, but connected, things.

One is the datamodel itself. It's the abstraction layer that I described before. It's just a configuration definition. On its own it doesn't produce or store any data (OK, you may have calculated fields in the datamodel but that's not much different from calculated fields for your events).

So that's the datamodel as such - it defines what "layout" of data you should have and if you want your events to be your-datamodel-compatible you have to define your sourcetypes so that the events yield relevant fields. That's what most good TA-s do - they make the events CIM-compatible by providing proper aliases and calculcated fields. For example - if CIM Network Traffic datamodel defines a "dest_ip" field and your firewall data contains field named "dstip", a proper TA will contain a field alias which will present the dstip also as dest_ip so that it can be found by that name.

That's pretty much the datamodel itself is.

But another thing is datamodel acceleration. And that's what I'm talking about from the start. Datamodel and datamodel acceleration are two different things (although people often confuse them).

Datamodel acceleration is a mechanism which runs a scheduled search and creates a summary from the datamodel definition. So that if you search from datamodel splunk doesn't have to go through the raw data and extract all the fields on the fly but can get them from the indexed summaries. It's very similar to indexed fields (and even uses the same type of files in the backend).

So if you enable acceleration for a particular datamodel, splunk will run a datamodel acceleration summary search according to schedule configured with the datamodel acceleration and will update the summaries for given datamodel.

Then if you search from accelerated datamodel results for timerange covered by the already-built summaries will be retrieved from those pre-built summaries and for the most recent (or older than the acceleration time limit) events will be calculated directly from raw data.

If you specify "summariesonly=t" with your search (or tstats), splunk will use _only_ the accelerated summaries, it will not reach for the raw data (which means that your results will not contain most recent events - those that had not yet been summarized).

One important thing - datamodel acceleration is not available for Splunk Free whereas normal datamodel functionality is.

Taruchit · ‎06-17-2023

Hi @PickleRick,

Thank you for your response. Your inputs have enabled me to understand about datamodel and datamodel acceleration, and also allowed me to read through the documentation and tutorials again with more clarity.

When I execute below SPLs, can you please help to confirm if I am getting the events of scheduled search that runs periodically to fetch, summarize and store records for the datamodels? If not, what I may be looking at with the events fetched?

index=_internal seacrh_type="datamodel_acceleration"

index=_audit search_type="datamodel_acceleration"

And do we have any other way to fetch logs of scheduled searches that run when datamodel acceleration is enabled?

Please share if there is anything that I misunderstood or wrongly stated in above message so that I can rectify it.

Thank you

PickleRick · ‎06-17-2023

Yes, you can list instances of acceleration summarizing searches with

index=_internal search_type="datamodel_acceleration"

You can also show parameters of accelerated datamodels if you can use REST calls by

| rest /services/datamodel/acceleration

Taruchit · ‎06-17-2023

Thank you for sharing.

VatsalJagani · ‎06-19-2023

@Taruchit - Did answer from @PickleRick resolved your query? If so, kindly please accept the right answer by clicking on "Accept as Solution" button under the correct answer/comment.

How datamodels work in Splunk?

data model

other

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?