topic Re: Is it possible to use appendcols without multiple searches to improve efficiency? in Splunk Search

Is it possible to use appendcols without multiple searches to improve efficiency?

asherman — Thu, 14 Aug 2014 14:55:06 GMT

Hi,

I am trying to build a table of results, and my query has the form:

Each search is the same while the subsequent operations differ. I can get the table I want, but as my data grows my query's time is increasing a lot. Is it possible for me to save the search (variable based on a given query) so that I can reuse the results and avoid the duplicate searches? Alternatively, is there a better command for this purpose than appendcols?

Thanks

Edit:
The full query is:

index=test_index3 base="*" corner=*100c*
| dedup dir corner path name
| stats count as paths by block
| appendcols [search index=test_index3 base="*" corner=*100c* 
    | stats count(eval(isNum(base))) as num_vals, count as num_paths by block
    | eval percent=round(num_vals*100/num_paths,2)
    | chart max(percent) as "num_vals" by block]
| appendcols [search index=test_index3 base="*" NOT base="*nan*" corner=*100c* 
    | stats count(eval((base-min)>0 AND (base-max)<0)) AS bounded, count AS num_vals by block
    | eval percent=round(bounded*100/num_vals,2)
    | chart max(percent) as "bounded paths" by block]
| appendcols [search index=test_index3 base="*" NOT base="*nan*" corner=*100c* 
| stats count(eval((base-max)<0)) AS bounded, count AS num_vals by block
    | eval percent=round(bounded*100/num_vals,2)
    | chart max(percent) as "max bounded paths" by block]
| appendcols [search index=test_index3 base="*" NOT base="*nan*" corner=*100c* 
    | stats count(eval((base-min)>0)) AS bounded, count AS num_vals by block
    | eval percent=round(bounded*100/num_vals,2)
    | chart max(percent) as "min bounded paths" by block]
| fields block, paths, num_vals, "bounded paths", "max bounded paths", "min bounded paths" 
| rename num_vals AS "% data"
| rename "bounded paths" AS "% of data bounded"
| rename "max bounded paths" AS "% of data bounded by max"
| rename "min bounded paths" AS "% of data bounded by min"

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

martin_mueller — Thu, 14 Aug 2014 15:56:46 GMT

Chances are you can do away with the subsearches and appendcols entirely. Post the full search and we'll see.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

asherman — Thu, 14 Aug 2014 16:06:58 GMT

Sure, done.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

martin_mueller — Thu, 14 Aug 2014 16:50:51 GMT

First off, corner=*100c* usually is quite inefficient because of the leading wildcard. If the number of events scanned vs the number of events matched is high then you may be able to speed everything up here. How depends on your data, look for a way to identify relevant events based on full tokens or at least prefixes of tokens, ie without leading wildcards.

Looking at the search itself, here's how I'd start by combining the first two parts.

(1) index=test_index3 base="*" corner=*100c* | dedup dir corner path name | stats count as paths by block

The dedup isn't present in the other searches, so let's avoid using it. This basically counts how many distinct quads of (dir, corner, path, name) you have for each value of block... so you can replace (1) with this:

(2) index=test_index3 base="*" corner=*100c* | eval dcpn = dir ."###". corner ."###". path ."###". name | stats dc(dcpn) as paths by block

Now let's look at the first subsearch:

(3) index=test_index3 base="*" corner=*100c* | stats count(eval(isNum(base))) as num_vals, count as num_paths by block | eval percent=round(num_vals*100/num_paths,2) | chart max(percent) as "num_vals" by block

That final chart looks useless to me. There only is one row for each block, so that basically does the same as fields - num_paths | rename percent as num_vals.

(4) index=test_index3 base="*" corner=*100c* | stats count(eval(isNum(base))) as num_vals, count as num_paths by block | eval num_vals=round(num_vals*100/num_paths,2) | fields - num_paths

When you compare (2) and (4), the searches basically do the same thing. Load the same events and compute some statistics... so you can combine them like this:

(5) index=test_index3 base="*" corner=*100c* | eval dcpn = dir ."###". corner ."###". path ."###". name
| stats dc(dcpn) as paths, count(eval(isNum(base))) as num_vals, count as num_paths by block 
| eval num_vals=round(num_vals*100/num_paths,2) | fields - num_paths

Without knowing your data - in (1) you dedup before counting the number of paths, in (4) you don't. Is that on purpose? From here it feels weird, but might match your data and use case so I left it as is.

Using this strategy you should be able to combine the other subsearches into the main search as well.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

asherman — Thu, 14 Aug 2014 18:03:29 GMT

Thank you for this - my query is now a lot neater and faster.

The purpose of the dedup was to remove duplicate entries for the same dir/corner/path/name (dcpn) set. It should have been with every search, but I don't think I had duplicate (dcpn) data entries yet. Using a single query as you've described, I should be able to use the dedup at the start to cover the entire query.

Thanks.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

martin_mueller — Thu, 14 Aug 2014 18:18:22 GMT

Ah, I see - then it does indeed makes sense to use dedup for the entire search.

Consider posting the finished search for others who may later find this question.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

asherman — Thu, 14 Aug 2014 21:54:02 GMT

Final result:

index=test_index3 base="*" corner=100c*
| dedup dir corner path name hs_or_lp
| eval dcpn = dir ."###". corner ."###". path ."###". name
| stats dc(dcpn) as paths, 
     count(eval(isNum(base))) as num_vals, 
     count(eval((base-min)>0 AND (base-max)<0)) AS bounded, 
     count(eval((base-max)<0)) AS bounded_max, 
     count(eval((base-min)>0)) AS bounded_min by block
  | eval percent=round(num_vals*100/paths,2) 
  | eval bounded_perc=round(bounded*100/num_vals,2) 
  | eval bounded_max_perc=round(bounded_max*100/num_vals,2) 
  | eval bounded_min_perc=round(bounded_min*100/num_vals,2)
| fields block, percent, bounded_perc, bounded_min_perc, bounded_max_perc 
| rename percent as "% of data"
| rename bounded_perc as "% of data bounded"
| rename bounded_max_perc as "% of data bounded by max"
| rename bounded_min_perc as "% of data bounded by min"

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

asherman — Thu, 14 Aug 2014 21:54:39 GMT

Sure.

By the way, what exactly is the ."###". doing? I'm not entirely clear on why that works.

Re: Is it possible to use appendcols without multiple searches to improve efficiency?

martin_mueller — Thu, 14 Aug 2014 22:01:36 GMT

The dots concatenate strings, and without knowing your data I added a separator to make fairly sure that concatenating the four fields doesn't mess things up. Here's an example of how things would mess up with unlucky data:

d   c   p   n
1   2   3   45
1   2   34  5

Concatenating those two quads and then deduping or dc()ing would treat them as duplicates despite them obviously not being duplicates - both would end up being 12345. Adding the separator avoids that with almost certainty without knowing the data.