Solved: Why is base search more expensive?

DG · ‎07-12-2022

Dear Community,

I would like to get some assistance and/or clarification regarding Splunk’s base-search/post-processing functionality. I have read it/heard that using one base-search and post processing instead of several similar queries is cost effective, we can save SVCs (splunk virtual computes) with it. In practice, unfortunately I have experienced quite the opposite:

Let’s say, I have a dashboard (call it “A”) with these queries:

index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | stats dc(user_id) as "Unique users, who has logged ..."
index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | timechart count by result
index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | dedup user_id | timechart span=1h count as "per hour"| streamstats sum("per hour") as "total"
index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | timechart dc(user_id) as "Unique users"
index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Failed" AND reason != "bbb" | timechart count by reason

I cloned this “A” dashboard (let’s call the clone “B”).

I got some issues, like I got no data, or the numbers were different on “B” than “A”, but after some googling, reading Splunk community, I managed to get the same results on “B” with:

A base search:

index="myIndex" "[OPS] [INFO] event=\"asd\"" | stats count by user_id is_aaaaa_login environment result reason _time

Post-processes:

search | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | stats dc(user_id) as "Unique users, who has logged ..."
search | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | timechart count by result
search | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | dedup user_id | timechart span=1h count as "per hour"| streamstats sum("per hour") as "total"
search | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Successful" | timechart dc(user_id) as "Unique users"
search | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod" AND result="Failed" AND reason != "bbb" | timechart count by reason

I have added ‘refresh=”180”’ to the top of these two dashboards and leave them open in my browser for about one hour (and the common date-picker was set to “last 24 hours”). After this, I was surprised when I saw that dashboard “A” in “Splunk App for Chargeback” consumed around 5 SVCs while dashboard “B” used around 15 SVCs. So the dashboard with the base-search was way more expensive than the “normal” one. I thought that it will be much cheaper.

Why is that? Did I construct my base/post-process queries badly? If yes, what should I change?

I searched a lot, I found only one comment on Splunk community here:

https://community.splunk.com/t5/Dashboards-Visualizations/Base-Search-for-dashboard-optimization/m-p...

“However, I do not recommend it when dealing with large data because base search is slow.” which implies that maybe base search is not always a cheaper solution?! So I executed only my base-search in Splunk for a 24 hours interval, it gave back a table with around 3,000,000 rows. Does this mean a large data set? Should I forget using base-searches?

Thank you very much for your help!

bowesmana · ‎07-12-2022

Your base search is not really reducing the data set, as it is aggregating by the 6 fields, including _time, so it's likely the base search result set will be the entire data set, so it's quite possible that this is not a good use case.

Of your 5 post process searches, you have 1 stats, 3 'unspanned' timecharts and 1 spanned (1h) timechart.

2 of those timecharts are simple ones

count by reason - nb reason is redundant as it's always Successful)
dc(user_id)

As ITWhisperer says, the additional filters (user_id, is_aaaaa_login, environment) should also be part of the base search. Is there a reason why not and how many of the 3million events are included unnecessarily?

You may be better off having more than one search, e.g. one for the timechart where you're counting events and unique users. Note that you can also include the user_id VALUES so that you can then do the subsequent stats dc(user_id). That means a single base search can handle results for 3 post process searches and will not require a big data set

<search id="base_tc">
  <query>
index="myIndex" "[OPS] [INFO] event=\"asd\"" user_id != "0" is_aaaaa_login="true" environment="prod" result="Successful" 
| timechart count dc(user_id) as users values(user_id) as user_ids
  </query>
</search>

Note that you can include the filters from the 'where' clause as part of the original search

The two timecharts would then look like this

<search base="base">
  <query>
| fields _time count
  </query>
</search>

<search base="base">
  <query>
| fields _time users
  </query>
</search>

and the stats one would be

<search base="base">
  <query>
| stats dc(user_ids) as user_ids
  </query>
</search>

You are then left with the 1h timechart and the failed results timechart, which could be their own search.

View solution in original post

DG · ‎07-13-2022

Thank you very much! Both solution (loadjob from ITWhisperer and base search from bowesmana) worked, saved SVCs for us, we have to measure a few times to get a more accurate picture of exactly how much, but once it was 75% saving, other time it was around 40% saving. I'm quite new here, can I accept both as solution?

bowesmana · ‎07-13-2022

Great that you got some good savings - as for accepting two solutions, you can only accept one, so choose wisely 😀😀

DG · ‎07-13-2022

Your answer is more detailed and I got more explanations, so I accepted yours. 🙂

However, it seems that on average we gain more with the loadjob solution. I don't know why the SVC consumptions are so different, I'm running the default dashboard and these two solutions with the same "refresh=180" attribute for one hour.

bowesmana · ‎07-13-2022

Interesting - I've not used the idea of a pseudo base search and then using loadjob to post process.

I guess that using a single base search split in the way you have it and then using loadjob, is then still just having the single search, hence the improved savings.

ITWhisperer · ‎07-13-2022

Without having done too much investigation, the way I have found base searches sometimes working is more like a shorthand for the first part of the search, by that I mean that when the post-processing search needs to execute, it executes the base search then the post processing.

If your dashboard uses the same start to the search in a number of places, rather than writing multiple copies of the search, you can write once and use multiple times; it is a bit like extending a class in object-oriented paradigm.

Now, this may be down to the searches I have been using and whether they can be executed on indexers or have to come back to the search head(?) - as I said, I haven't investigated the detail of this, and don't have a detailed understanding of how this all works.

Once I had found out how to use loadjob to gain performance, I didn't bother investigating further.

There is a caveat to this approach. The results from the base query do not hang around for ever so the sid may become stale (and not return any results) in which case the base search is executed again (generating a new sid).

In some circumstances, the way to get around this is to use saved reports. This assumes that you have a saved report that covers the time period you need for your dashboard. For example, I have a number of dashboards which are based on the past week or month. These results don't change during the day, so I can run a report in the early hours and then load these results throughout the day without having to search the whole month every time the dashboard loads. This is a tremendous boost to dashboard performance, although not applicable in every case. (Perhaps I should consider finding or doing a BSides presentation on this, as I don't think this short piece has done the topic justice 😀)

DG · ‎07-22-2022

I got it! Thank you very much!! 🙂

bowesmana · ‎07-12-2022

Your base search is not really reducing the data set, as it is aggregating by the 6 fields, including _time, so it's likely the base search result set will be the entire data set, so it's quite possible that this is not a good use case.

Of your 5 post process searches, you have 1 stats, 3 'unspanned' timecharts and 1 spanned (1h) timechart.

2 of those timecharts are simple ones

count by reason - nb reason is redundant as it's always Successful)
dc(user_id)

As ITWhisperer says, the additional filters (user_id, is_aaaaa_login, environment) should also be part of the base search. Is there a reason why not and how many of the 3million events are included unnecessarily?

You may be better off having more than one search, e.g. one for the timechart where you're counting events and unique users. Note that you can also include the user_id VALUES so that you can then do the subsequent stats dc(user_id). That means a single base search can handle results for 3 post process searches and will not require a big data set

<search id="base_tc">
  <query>
index="myIndex" "[OPS] [INFO] event=\"asd\"" user_id != "0" is_aaaaa_login="true" environment="prod" result="Successful" 
| timechart count dc(user_id) as users values(user_id) as user_ids
  </query>
</search>

Note that you can include the filters from the 'where' clause as part of the original search

The two timecharts would then look like this

<search base="base">
  <query>
| fields _time count
  </query>
</search>

<search base="base">
  <query>
| fields _time users
  </query>
</search>

and the stats one would be

<search base="base">
  <query>
| stats dc(user_ids) as user_ids
  </query>
</search>

You are then left with the 1h timechart and the failed results timechart, which could be their own search.

DG · ‎07-13-2022

"including _time, so it's likely the base search result set will be the entire data set" -> yes, now I think so, too.

I have included _time, because my post-processing searches displayed error or invalid data, so I googled and read it i.e. here:

https://community.splunk.com/t5/Splunk-Search/Is-it-possible-to-create-Time-chart-with-search-with-b...

that I should try using "| fields *", or "stats count by _time" here:

https://community.splunk.com/t5/Splunk-Search/help-on-base-search-event-limit/m-p/574058#M200053

"As ITWhisperer says, the additional filters (user_id, is_aaaaa_login, environment) should also be part of the base search. Is there a reason why not and how many of the 3million events are included unnecessarily?"

-> Totally true, my mistake, I did not realize that they can be the part of the base search.

ITWhisperer · ‎07-12-2022

The common part of your searches appears to be this

index="myIndex" "[OPS] [INFO] event=\"asd\"" | where user_id != "0" AND is_aaaaa_login="true" AND environment="prod"

You could try creating a "base" search with this and then in the done handler save the job sid in a token

Then in subsequent searches, you use loadjob to load the result set of the base search and apply further filtering (result = x or y) and your stats calculations.

Why is base search more expensive?

troubleshooting

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?