topic Re: Schema Accelerated Event Search performance in Splunk Search

Schema Accelerated Event Search performance

awmorris — Thu, 07 Mar 2019 16:30:00 GMT

I am super stoked about the potential of Schema Accelerated Event Searches- might be one of the best improvements i've seen if i could actually get it to work- but it doesn't. 😞

Don't focus on the fact that i'm only returning the count of events... performance doesn't differ if i returned the raw events (which is ultimately what i want to do).... i'm just doing the count so i can make an apples-to-apples comparison.

So consider the following two searches over 15 minutes of data:

SEARCH # 1

|tstats summariesonly=true count from datamodel="Web" where Web.user="dmerritt"

The value returned was 25. The search itself took 2.676 seconds

SEARCH # 2

|from datamodel Web|search user=dmerritt|stats count

The value returned was 106. The search itself took 2 minutes, 14 seconds.

QUESTIONS:
1) Why the HUGE difference in performance?
2) Why is the result count different?

NOTE : Am running Splunk 7.1.5

Re: Schema Accelerated Event Search performance

woodcock — Fri, 08 Mar 2019 03:02:09 GMT

Fist of all, I wouldn't use | from datamodel because it was recently broken and no longer returns all fields (only the ones in the datamodel). Instead use the macro described here:
https://answers.splunk.com/answers/716936/splunk-server-field-is-not-available-when-we-searc.html#answer-717058
Then do this:

`SIEMMacro_datamodelCIM(Web, Web)` user="dmerritt" | stats count

Or possibly this:

`SIEMMacro_datamodelCIM(Web, Web)` TERM(user=dmerritt) | stats count

Notice that there is no pipe ( | ) before the | stats; that is why this macro makes these searches way faster.

Now, the non-tstats search returns fewer results because the data model acceleration (DMA) will always run behind, usually for less than 5 minutes. This is why you often see tstats searches with Time picker values of earliest=-65m latest=-5m. So for a test, run all the searches for a full day back by adding this to each search earliest=-1d@d latest -1d@d+1h and you should get the same result from every search.

The huge difference in performance is because the tstats command is getting the results from a metadata index that summarizes the raw data and does not have to unzip the raw data ( journal.gz ) files to get the answers.

To see that I am right, swap the boolean on summariesonly like this:

|tstats summariesonly=false count from datamodel="Web" where Web.user="dmerritt"

You will see that it returns all of the results, but is much slower.

P.S. If this is the A.Morris that I think that it is, I emailed Daneil about this macro months ago.

Re: Schema Accelerated Event Search performance

awmorris — Sun, 17 Mar 2019 03:59:20 GMT

This is something slightly different although i'll give you a nod that the "|from datamodel" appears terribly broken. Here's the background... i was talking with a Splunk employee who was lauding the recent benefits in Splunk. Specifically, he said that the data models now include a "hidden" pointer back to the actual raw event. This means you can search a data model to get the speed benefits of accelerated data models BUT your search can now return the FULL raw event- not just the data contained within the data model. Clearly this is SUPER useful because this opens a world of new possibilities. The obvious limitation is that the initial search constraint must be in the data model itself. It is also worth noting this same feature was mentioned by David Veuve in his Security Ninjitsu preso @ .conf2018.

The problem is that it doesn't work as advertised. 😞

Re: Schema Accelerated Event Search performance

woodcock — Sun, 17 Mar 2019 17:24:29 GMT

Do tell! How is this pointer accessed?

Re: Schema Accelerated Event Search performance

nick_cribl — Sun, 17 Mar 2019 21:47:45 GMT

The reason you're seeing count and perf differences is because | from and | datamodel are running in "mixed mode" searching by default (and is the only option in 7.1). There were plans to add summariesonly option to | datamodel; however, it appears that hasn't been added ( allow_old_summaries does look like it was added in 7.2). You're likely to see a count difference between tstats summariesonly=t and | (from|datamodel) searches due to this (since the latter will search the hot buckets for new events that have yet to be summarized). To get an apples-to-apples comparison on performance, try |from datamodel Web|search user=dmerritt| noop directive.read_summary=f against |from datamodel Web|search user=dmerritt. That noop command should disable Schema Accelerated Event Search.

As for only datamodel-defined fields appearing in these searches. This was the original design of the | datamodel command; however, somewhere along the way, this broke and all fields were being returned. In order for us to implement Schema Accelerated Event Search, we had to fix this bug since only the fields defined within the data model are stored within the accelerated index and leaving this bug hanging around broke the implementation.

Re: Schema Accelerated Event Search performance

my2ndhead — Fri, 22 Mar 2019 13:54:35 GMT

Note that you can add a | extract after | from datamodel:and you will get fields that are not in the datamodel!

Re: Schema Accelerated Event Search performance

awmorris — Fri, 22 Mar 2019 16:30:38 GMT

Can you provide an example? I tested and my experience differs. I thought extract simply broke apart key/value pairs.

Re: Schema Accelerated Event Search performance

woodcock — Fri, 22 Mar 2019 16:49:36 GMT

It depends if they are encoded in _raw. Sometimes they are not.

Re: Schema Accelerated Event Search performance

my2ndhead — Mon, 25 Mar 2019 20:16:28 GMT

Just like this e.g:

| from datamodel:Authentication 
| extract

vs.

index=* source="XmlWineventlog:Security" tag=authentication  NOT (user=*$ action=success )

The number of fields will not be the same, as extract does not add field aliases. Compared this with fieldsummary.