Re: Schema Accelerated Event Search performance

awmorris · ‎03-07-2019

I am super stoked about the potential of Schema Accelerated Event Searches- might be one of the best improvements i've seen if i could actually get it to work- but it doesn't. 😞

Don't focus on the fact that i'm only returning the count of events... performance doesn't differ if i returned the raw events (which is ultimately what i want to do).... i'm just doing the count so i can make an apples-to-apples comparison.

So consider the following two searches over 15 minutes of data:

SEARCH # 1

|tstats summariesonly=true count from datamodel="Web" where Web.user="dmerritt"

The value returned was 25. The search itself took 2.676 seconds

SEARCH # 2

|from datamodel Web|search user=dmerritt|stats count

The value returned was 106. The search itself took 2 minutes, 14 seconds.

QUESTIONS:
1) Why the HUGE difference in performance?
2) Why is the result count different?

NOTE : Am running Splunk 7.1.5

nick_cribl · ‎03-17-2019

The reason you're seeing count and perf differences is because | from and | datamodel are running in "mixed mode" searching by default (and is the only option in 7.1). There were plans to add summariesonly option to | datamodel; however, it appears that hasn't been added ( allow_old_summaries does look like it was added in 7.2). You're likely to see a count difference between tstats summariesonly=t and | (from|datamodel) searches due to this (since the latter will search the hot buckets for new events that have yet to be summarized). To get an apples-to-apples comparison on performance, try |from datamodel Web|search user=dmerritt| noop directive.read_summary=f against |from datamodel Web|search user=dmerritt. That noop command should disable Schema Accelerated Event Search.

As for only datamodel-defined fields appearing in these searches. This was the original design of the | datamodel command; however, somewhere along the way, this broke and all fields were being returned. In order for us to implement Schema Accelerated Event Search, we had to fix this bug since only the fields defined within the data model are stored within the accelerated index and leaving this bug hanging around broke the implementation.

woodcock · ‎03-07-2019

Fist of all, I wouldn't use | from datamodel because it was recently broken and no longer returns all fields (only the ones in the datamodel). Instead use the macro described here:
https://answers.splunk.com/answers/716936/splunk-server-field-is-not-available-when-we-searc.html#an...
Then do this:

`SIEMMacro_datamodelCIM(Web, Web)` user="dmerritt" | stats count

Or possibly this:

`SIEMMacro_datamodelCIM(Web, Web)` TERM(user=dmerritt) | stats count

Notice that there is no pipe ( | ) before the | stats; that is why this macro makes these searches way faster.

Now, the non-tstats search returns fewer results because the data model acceleration (DMA) will always run behind, usually for less than 5 minutes. This is why you often see tstats searches with Time picker values of earliest=-65m latest=-5m. So for a test, run all the searches for a full day back by adding this to each search earliest=-1d@d latest -1d@d+1h and you should get the same result from every search.

The huge difference in performance is because the tstats command is getting the results from a metadata index that summarizes the raw data and does not have to unzip the raw data ( journal.gz ) files to get the answers.

To see that I am right, swap the boolean on summariesonly like this:

|tstats summariesonly=false count from datamodel="Web" where Web.user="dmerritt"

You will see that it returns all of the results, but is much slower.

P.S. If this is the A.Morris that I think that it is, I emailed Daneil about this macro months ago.

awmorris · ‎03-16-2019

This is something slightly different although i'll give you a nod that the "|from datamodel" appears terribly broken. Here's the background... i was talking with a Splunk employee who was lauding the recent benefits in Splunk. Specifically, he said that the data models now include a "hidden" pointer back to the actual raw event. This means you can search a data model to get the speed benefits of accelerated data models BUT your search can now return the FULL raw event- not just the data contained within the data model. Clearly this is SUPER useful because this opens a world of new possibilities. The obvious limitation is that the initial search constraint must be in the data model itself. It is also worth noting this same feature was mentioned by David Veuve in his Security Ninjitsu preso @ .conf2018.

The problem is that it doesn't work as advertised. 😞

woodcock · ‎03-17-2019

Do tell! How is this pointer accessed?

my2ndhead · ‎03-22-2019

Note that you can add a | extract after | from datamodel:and you will get fields that are not in the datamodel!

awmorris · ‎03-22-2019

Can you provide an example? I tested and my experience differs. I thought extract simply broke apart key/value pairs.

my2ndhead · ‎03-25-2019

Just like this e.g:

| from datamodel:Authentication 
| extract

vs.

index=* source="XmlWineventlog:Security" tag=authentication  NOT (user=*$ action=success )

The number of fields will not be the same, as extract does not add field aliases. Compared this with fieldsummary.

woodcock · ‎03-22-2019

It depends if they are encoded in _raw. Sometimes they are not.

Schema Accelerated Event Search performance

Get Inspired! We’ve Got Validation that Your Hard Work is Paying Off

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)