Re: Efficiently filter subsequent search statement...

shanebough · ‎03-25-2021

I am using Splunk Enterprise Version 8.0.5.1

Consider an index with half a million events being generated every day.

There are three fields in the index I am particularly interested in.

sourcetype has 20 different values, but I am interested in one sourcetype that accounts for 10,000 events per day

objectid is populated on each event and there are multiple events for the same objectid.

versionid can be two different values : 1 or 2 - objects move from 1 to 2, but never back to 1.

For the query period, there should be no objectids with a versionid of 1.

For those objectids that have an event with versionid 1, I want to know when they changed to a 2.

The problem I have is that there are so many 2s in the index, querying all of them just to then join to the 1s is taking forever and generates a job of over 1GB.

So, what I'd really like to do is to query the 1s first, and then feed that list into a subsequent search where it only finds the rows with the objectid in the results of the first search for the 1s.

If I do this, it just takes forever...

index=myindex sourcetype=mysourcetype versionid=1
| reverse | table _time objectid | dedup objectid
| join objectid [ index=myindex sourcetype=mysourcetype versionid=2 | reverse | eval fixed=_time | table fixed objectid | dedup objectid ]
| table _time objectid fixed

What I want to do is to reference the results of the first query in an IN statement inside the second, but I can't find a way to do that. If I could create a dashboard with a base query, and a panel uses the results of that base query in an IN statement, that might work, but at the moment I am stuck.

I know that Splunk is not SQL, but to make it a bit clearer what I am trying to achieve...

SELECT *
FROM MYINDEX
WHERE OBJECTID IN (SELECT OBJECTID FROM MYINDEX WHERE VERSIONID=1)
AND VERSIONID=2

i.e. it evaluates the versionid=1 objectids first, and then the outer query only returns rows that match those ids. When I do this manually with a small number of IDs and put them in an explicit IN clause, it runs very quickly.

Any suggestions would be much appreciated.

shanebough · ‎04-15-2021

The best I can come up with so far is the following...

index=myindex sourcetype=mysourcetype versionid=1
| reverse | table _time objectid | dedup objectid
| join objectid 
  [ search index=myindex sourcetype=mysourcetype versionid=2 
           [ search index=myindex sourcetype=mysourcetype versionid=1 | dedup objectid | table objectid ]
           | reverse | eval fixed=_time | table fixed objectid | dedup objectid ]
| table _time objectid fixed

This seems a bit expensive as the outer query is being executed twice.

Does anyone have any better ideas?

Thanks,
Shane

scelikok · ‎03-26-2021

Hi @shanebough,

Please try below;

index=myindex sourcetype=mysourcetype versionid=2 
    [ index=myindex sourcetype=mysourcetype versionid=1 
    | stats count by objectid 
    | fields objectid ]

If this reply helps you an upvote and "Accept as Solution" is appreciated.

shanebough · ‎03-26-2021

Thank you for the suggestion. Unfortunately, this is the longest running version so far. But I don't see which part of this achieves the objective of only returning the rows that match the objectids in the initial query of versionid=1. How is this supposed to achieve that?

isoutamo · ‎03-25-2021

Hi

I think that you got the idea from this:

| makeresults 
| eval _raw = "time,objectid,versionid
1616661604,123,1
1616662604,124,1
1616663604,122,1
1616664604,123,2
1616665604,125,1
1616666604,124,2
1616667604,124,2"
| multikv forceheader=1
| eval _time=time
``` Generate test data ```
| stats values(versionid) as versions range(_time) as duration by objectid
| where mvcount(versions) > 1
| eval duration = tostring(duration, "duration")

r. Ismo

shanebough · ‎03-25-2021

Thanks for the suggestion, but it takes much longer, unfortunately, because it still evaluates every event with versionid=2 instead of only bringing back those for objectid=1 first and then filtering on those first.

isoutamo · ‎03-25-2021

Have you try these proposals in your environment or have you just decide that those didn't work?

I have evaluated those with our data (of course not exactly same than you have, but amount and cardinality should be enough equal).

With your example:

This search has completed and has returned 757 results by scanning 73,213 events in 7.786 seconds

The following messages were returned by the search subsystem:

info : [subsearch]: Subsearch produced 68838 results, truncating to maxout 50000.

===> not correct result

With my:

This search has completed and has returned 1,169 results by scanning 141,222 events in 3.862 seconds

And when I use smaller samples where sub search even works those execution times was relative

Yours/my: 1.335 vs 0.324s

Both queries returns same amount of events.

r. Ismo

shanebough · ‎03-25-2021

I said it takes longer, so, I would only know that from testing, right?

But, I can also see from the query that there is nothing in there that achieves what I have specifically asked for. Although the results may achieve the same dataset, it still does not do it in an efficient way. The query uses basic functions of Splunk that I am very familiar with and have tried in so many different ways. But the specific ask here is about how to quickly retrieve JUST the rows that match the ids returned in the first query.

shanebough · ‎03-25-2021

I know that Splunk is not SQL, but to make it a bit clearer what I am trying to achieve...

SELECT *
FROM MYINDEX
WHERE OBJECTID IN (SELECT OBJECTID FROM MYINDEX WHERE VERSIONID=1)
AND VERSIONID=2

I have updated the original post with this clarity. Hope it helps.

isoutamo · ‎03-25-2021

Maybe this helps you to figure out correct join? https://community.splunk.com/t5/Splunk-Search/What-is-the-relation-between-the-Splunk-inner-left-joi...

shanebough · ‎03-25-2021

Thanks for the link to a very good description of joins. I am very familiar with all the join options and have chosen the inner join as I would expect that to just bring back the rows where there is a matching id on both sides, but Splunk still returns all 5 million rows (or whatever the count is for the various periods I have tried) rather than just return the 200 rows for the ids that I am interested in, which I can get back quickly if I put the long list of IDs in the query. But I can't specify the list up front because I don't know which ids are going to still be on versionid=1.

How can I force the second query to only bring back the ids from the first query? I am worried that I am missing an obvious attribute of the join (and have checked the Splunk docs https://splk.it/3smaaNf) that would allow this filtering to happen.

I appreciate the time you are spending on this. Many thanks.

isoutamo · ‎03-25-2021

Can you try this one:

index=myindex sourcetype=mysourcetype versionid IN (1, 2)
| fields _time objectid versionid <other fields which you are needing>
| stats range(_time) as duration values(*) as * by objectid
| where mvcount(versionid) > 1 AND isnotnull(mvfind(versionid, 1)) AND isnotnull(mvfind(versionid, 2))
| eval duration = tostring(duration, "duration")

This works for me.

shanebough · ‎03-25-2021

Thanks again! 🙂

This also takes a long time to run, basically because the very first statement pulls back everything. It is pulling back all the 2s, when I only want the 2s that have a 1. Forget all the other processing after that statement to calculate the duration (although there is some interesting syntax in there that I haven't used before that looks cool that I am going to investigate separately, thank you) but if you can trim the query down to just the first statement and get that to just return the 1s and the 2s that have a 1, and do it without first pulling back all the 2s and then filtering afterwards, that will solve my problem. I am guessing there is no way to achieve that as no-one on any forums seems to have a solution.

ITWhisperer · ‎03-25-2021

Try something like:

index=myindex sourcetype=mysourcetype
| stats first(_time) as _time by objectid versionid
| eval fixed=if(versionid=2,_time,null)
| stats first(_time) as _time values(fixed) as fixed by objectid
| eval fixed=if(isnotnull(fixed), strftime(fixed,"%Y-%m-%d %H:%M:%S"),"Not fixed")

shanebough · ‎03-25-2021

Thanks for the suggestion, but that takes much longer, unfortunately.

I am really looking for some way to quickly ONLY return the versionid=2 events that are for the subset of objectids that have any versionid=1 events

Thanks

Efficiently filter subsequent search statement using results from previous statement

join

subsearch

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Stay Connected: Your Guide to October Tech Talks, Office Hours, and Webinars!

Are you a member of the Splunk Community?