Solved: Pattern: loopable lookup table to bypass map subse...

MonkeyK · ‎11-25-2017

This is not so much a question. But I don't see a solutions.splunk.com, so I will post this solution here, improve it with any feedback, and post the improved pattern as an answer.

Anti-pattern: map is a subsearch, so mapped searches are limited in the results that they can return.

example: https://answers.splunk.com/answers/589791/can-i-do-kmeans-by-a-for-a-column-by-a-column.html

pattern: loopable lookup table, create a lookup table that will provide a single successive value with each run.

how-to
1) create a table of values that you would like to loop on, with columns for loopable value and status. Status needs to be something that can be compared against. For example, all destination ports visited in a time period

| tstats summariesonly=t first(All_Traffic.dest_port) as port 
from datamodel=Network_Traffic.All_Traffic 
where All_Traffic.session_id!=0 All_Traffic.dest_zone=external
by All_Traffic.dest_port 
| eval status="A" | fields + port status
| outputlookup test_loop_ports.csv

2) Do some work using the minimum value from the loopable table and update the minimum value's status.

 tstats summariesonly=t count 
from datamodel=Network_Traffic.All_Traffic 
where [|inputlookup test_loop_ports.csv where NOT status=done | stats min(port) as All_Traffic.dest_port] 
by All_Traffic.dest_port 
| outputlookup append=t test_port_count.csv | eval status="done", port='All_Traffic.dest_port' 
| append [|inputlookup test_loop_ports.csv] 
| stats max(status) as status by port 
| outputlookup test_loop_ports.csv

How will this look?
lets say that the first query returns ports 0,53,80.443, so test_loop_ports.csv looks like

port  status
0       A
53      A
80      A
443     A

On the first run of the second query, I get a new table and a modified test_loop_ports.csv
test_loop_ports:

port  status
0       done
53      A
80      A
443     A

test_port_count.csv

port  count
0     10532

on the second run of the second query, port 0 has a status of "done", so the new minimum port is 53. After this run, our tables look like:
test_loop_ports:

port  status
0       done
53      done
80      A
443     A

test_port_count.csv

port  count
0     10532
53    96103

and so on until all statuses are done
test_loop_ports:

port  status
0       done
53      done
80      done
443     done

test_port_count.csv

port  count
0     10532
53    96103
80    909
443   1201

The two problems that I have with this process are that

I need to schedule the queries in a way that guarantees all values are processed. To do this I need to know about how long each will run and then schedule based on that guess.
Since the loopable lookup table is updated at the end, I cannot start the next query until the first one is done (otherwise I get the same minimum value on the second query). This means that I loose the optimization of running queries in parallel.

woodcock · ‎11-25-2017

I like your style. You may not realize this but multisearch does not use subsearches and so is not limited by their limits. So I would try something like this instead, so that it can run everything at the same time and not depend on a lookup file. I tested it for 2 "loops" so it should run just fine for many more than that. In this contortion, map is used only for it's templating function which allows us to set any number of f# fields and values that are then passed as variables to each lower search. The only thing is that you really need to know how many passes in advance (or maybe you can take this approach and extend it generically).

| makeresults 
| eval looper=mvrange(1,3) 
| mvexpand looper 
| streamstats count AS foo 
| eval looper = "f" . looper
| xyseries _time looper foo
| map search="|multisearch 
        [ search index=_* $f1$ | eval run=\"1:$f1$\" ] 
        [ search index=_* $f2$ | eval run=\"2:$f2$\" ]"

Actually, now that I re-think this, I am sure that the multisearch being inside of the map will cause the entirety of the merged results to be limited by the limit on map overall. So this will not work. It will have to be 2 parts. The first part sets the search details like this:

| makeresults 
| eval port=mvrange(1,3) 
| mvexpand port
| fields - _time
| outputlookup eraseme.csv

Then the second parts runs them all at the same time like this:

| multisearch 
        [ search index=_* [| inputlookup eraseme.csv |  head 1 |  tail 1 |  return $port] | eval run="1" ] 
        [ search index=_* [| inputlookup eraseme.csv |  head 2 |  tail 1 |  return $port] | eval run="2" ]

But then again, if this is what we are after, then why is a vanilla subsearch like this insufficient?

index=_* [| inputlookup eraseme.csv |table port]

View solution in original post

woodcock · ‎11-25-2017

I like your style. You may not realize this but multisearch does not use subsearches and so is not limited by their limits. So I would try something like this instead, so that it can run everything at the same time and not depend on a lookup file. I tested it for 2 "loops" so it should run just fine for many more than that. In this contortion, map is used only for it's templating function which allows us to set any number of f# fields and values that are then passed as variables to each lower search. The only thing is that you really need to know how many passes in advance (or maybe you can take this approach and extend it generically).

| makeresults 
| eval looper=mvrange(1,3) 
| mvexpand looper 
| streamstats count AS foo 
| eval looper = "f" . looper
| xyseries _time looper foo
| map search="|multisearch 
        [ search index=_* $f1$ | eval run=\"1:$f1$\" ] 
        [ search index=_* $f2$ | eval run=\"2:$f2$\" ]"

Actually, now that I re-think this, I am sure that the multisearch being inside of the map will cause the entirety of the merged results to be limited by the limit on map overall. So this will not work. It will have to be 2 parts. The first part sets the search details like this:

| makeresults 
| eval port=mvrange(1,3) 
| mvexpand port
| fields - _time
| outputlookup eraseme.csv

Then the second parts runs them all at the same time like this:

| multisearch 
        [ search index=_* [| inputlookup eraseme.csv |  head 1 |  tail 1 |  return $port] | eval run="1" ] 
        [ search index=_* [| inputlookup eraseme.csv |  head 2 |  tail 1 |  return $port] | eval run="2" ]

But then again, if this is what we are after, then why is a vanilla subsearch like this insufficient?

index=_* [| inputlookup eraseme.csv |table port]

MonkeyK · ‎11-25-2017

Whoa. multisearch invalidates my entire premise. The templating of map seemed nice until I understood that it was a subsearch.

Then I could multisearch on the 20 values.

But your second approach with just a straight up multisearch is easier to understand, I could set up my searches as

|mulitsearch
  | bigger search [|inputlookup test_loop_ports.csv | streamstats count as row|eventstats count as totalcount | head 1| tail 1 | where row<=totalcount | eval port=coalesce(port,"NA") | fields + port]
  | bigger search [inputlookup test_loop_ports.csv | streamstats count as row|eventstats count as totalcount | head 2| tail 1 | where row<=totalcount | eval port=coalesce(port,"NA") | fields + port]

(or even make a get_csvrow macro to be a little cleaner about it)

And then include enough multisearch clauses to exceed the number of ports that I could imagine showing up.

As to the final question, what is driving me is the need to do kmeans clustering by port. If I just feed kmeans all of the ports, then I do not get separate kmeans clustering for each port. This may apply to other outlier/clustering type searches as well

woodcock · ‎11-26-2017

So you have what you really need, right?

MonkeyK · ‎11-26-2017

of course, you are correct! Thank you for the new command to use

and the head|tail pattern

MonkeyK · ‎11-26-2017

Turns out that multisearch doesn't work for me. Multisearch only accepts streaming commands, I think that my eventstats commands do not comply, so I get:

Error in 'multisearch' command:
Multisearch subsearches may only contain purely streaming operations
(subsearch 1 contains a non-streaming command.)

Fortunately, I can use the head|tail thing and just schedule 30 searches accepting that 10 or so may have no results when I search for port="NA"

woodcock · ‎11-26-2017

You should be able to put your non-streaming command at the very end, outside of your multisearches, since you are surely doing the sameish thing. If you need to do a differentish thing, then you can tag each multisearch dataset uniquely and differentatiate on the outside at the end.

MonkeyK · ‎11-26-2017

Unfortunately, the initial part of my search uses tstats on the Network_Traffic datamodel.

Splunk says that this is non-streaming.

woodcock · ‎11-27-2017

The tstats command supports append=t for prestats mode and that function is unlimited, too, so just use |tstats append=t prestats and go crazy.

woodcock · ‎11-26-2017

Beware the blind use of the head/tail. Unless you know FOR SURE exactly how many rows there are, you will need better mojo because , for example, using a 10-row lookup, |head 10|tail 1, which is the last valid row, looks no different than |head 20|tail 1, which would be a duplicated run.

MonkeyK · ‎11-26-2017

That is where the macro comes in handy:

Pattern: loopable lookup table to bypass map subsearch limits - looking for feedback

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor