Splunk Search

Creating Custom Accelerated Data Models

cyberdiver
Explorer

LOOK FOR BOLD for quick overview:

I want to control the index-time extraction for events linked to an accelerated data model...

I am relatively new to Splunk, and recently I've jumped into Accelerated Data Models.  I understand a number of aspects about them already:

  • How they differ from regular data models
  • How they do index time extractions
  • Stored in HPAS
  • Populated by scheduled searches

What I don't understand is how those summaries for the Accelerated Data Models are built.  I understand that ADMs use tsidx files as the summaries of the raw data.

"Each search you run scans tsidx files for the search keywords and uses their location references to retrieve from the rawdata file the events to which those keywords refer.  Splunk Enterprise creates a separate set of tsidx files for data model acceleration. In this case, it uses the tsidx files as summaries of the data returned by the data model."

What I don't understand is how the connection to the raw data and the .tsidx files is made.   How are the .tsidx files formed from the event data?

When I look at the data models object hierarchy in settings I see the fields that it encompasses:

cyberdiver_0-1635023080749.png

When I do a search like:

 

| datamodel Intrusion_Detection search

 

If I'm correct, it is giving me the search time extraction from indexes related to the accelerated model. 

The problem is that I get a lot of fields that are useless in cyber security efforts.  For instance, maybe I want to know the category of the different attacks that are occurring.  It is a calculated field in my accelerated data model.  The calculation goes - if( isnull(category) OR category="","unknown",category.  This means it will return the category unless there is none.  I also don't understand where it gets this variable "category".  How is that being pulled from the raw data?

cyberdiver_0-1635024781474.png

I get 100% unknowns is the problem.  

Is this a problem of event tagging with the Common Information Model or somewhere else in the flow of ingested data? - https://wiki.splunk.com/images/4/45/Splunk_EventProcessing_v19_0_standalone.pdf

In the end here is what I want to know to fix this:

  • How do I control what it pulls out of the raw event data?
    • Where is the regex taking place?
    • Is this something to configure with the .tsidx summaries in the indexers?
  • When I have data like "geo-location" or "web-app" in the raw data used with a data model (data that I think it useful), how do I pull that data out into a field that I can use in my accelerated data model.
  • What does the Common Information Model have to do with Accelerated Data Models?
    • Is that where I configure what it pulls out of raw event data?
  • In general, how do I make more custom accelerated data models that pull out new data from events?

Additionally I understand that making more fields to pull out of the data also means for an increase in storage size on the indexer.  I just want to figure this all out.😁

[EDIT]

Is there where I would use the App: Splunk Add-on Builder?

 

Labels (2)
0 Karma

PickleRick
Champion

The datamodels are accelerated by splunk building and incrementally updating the summaries. The summaries are built by scheduled searches spawned by scheduler. By itself it has nothing to do with index-time extractions.

And about the various fields and so on - it's up to your admin (or data admin if you have separate role for this) to make your data CIM-compliant. CIM app on its own doesn't "do" anything. It just provides you with a schema to fill (you can compare it to an abstract classes in programming). It's the same schema for everyone using the app hence the name Common Information Model.

But now after you installed the CIM app you must make sure your data is properly matched to what CIM expects - must make sure that appropriate fields are calculated if they are not in your raw data and the events are tagged properly. With many TA-s it's done automatically by the app for given sourcetypes.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!