Solved: Re: best/fastest way to match results from two set...

loganramirez · ‎01-10-2024

I have an index that is receiving JSON data from a HEC, but with 2 different data sets and about 2M per day:

DS1
{guid:"a1b2",resourceId="enum",sourcenumber:"55512345678"}

DS2
{guid:"a1b2",resourceId="enum",disposition:"TERMINATED"}

Now, counting terminated is easy and fast, this runs in 1s for all calls yesterday.

index="my_data" resourceId="enum*" disposition="TERMINATED"
| stats count

But counting TOP 10 TERMINATED is not so much, this takes almost 10m on the same interval (yesterday):

index="my_data" resourceId="enum*"
| stats values(*) as * by guid
| search disposition="TERMINATED"
| stats count by sourcenumber

I found some help before using subsearches and found this | format thing to pass in more than 10k values, but this still takes ~8m to run:

index="my_data" resourceId="enum*" NOT disposition=*
[
search index="my_data" resourceId="enum*" disposition="TERMINATED" | fields guid | format
]
| stats count by sourcenumber
| sort -count

The issue is I need 'data' from DS1 when it 'matches guid' from DS2, but I've learned that 'join' isn't very good for Splunk (it's not SQL!)

Thoughts on the 'most optimized' way to get Top 10 of data in DS1 where certain conditions of DS2?

NOTE - I asked a similar question here, but can't figure out how to get the same method to work since it's not excluding, it's more 'joining' the data:
https://community.splunk.com/t5/Splunk-Search/What-s-best-way-to-count-calls-from-main-search-exclud...

As always, thank you!!!

dtburrows3 · ‎01-10-2024

Just out of curiosity, how fast does a search like this run?

index="my_data" resourceId="enum*" guid=* (disposition="TERMINATED" OR sourcenumber=*)
    | stats
        max(eval(if(disposition=="TERMINATED", 1, 0))) as guid_terminated,
        values(sourcenumber) as sourcenumber
            by guid
    | where 'guid_terminated'==1,
    | stats
        count as count
            by sourcenumber

I'm going to try and simulate your scenario with a dataset I have locally and see what can be done to speed it up.

View solution in original post

bowesmana · ‎01-10-2024

Are you doing indexed extractions on the JSON? If so, you may be able to use tstats to avoid looking at the raw data. It could be possible to use tstats (with prestats) to get the TERMINATED data set and then the full guid set using tstats append=t but cardinality may still be an issue - see below.

Do all events have resourceId="enum" in them? If so, adding the resourceId="enum*" is unnecessary

Of the 2M events, how many guids would you typically expect and how many sourcenumbers would you expect to see per guid?

How many indexers do you have in your deployment. If the cardinality is high for guid, then you are effectively returning much of the data to the search head for your 2M events.

You can be more specific with values(*) as * because you don't need resourceId

loganramirez · ‎01-10-2024

What are "indexed extractions"? Doc link?

Never heard of tstats.

We have 5 streams of data coming in, others are sip* so need that enum*, yes.

Of the 2M there will be 2M guids. 2M dataset 1, 2M data set 2, and then similar numbers on the sip side (so another 2M + 2M + 2M). So 10M total for all streams (more or less)

sourcenumbers per guid definitely isn't 1:1, so I'd have to count up, but guessing pareto principle applies, so like 20% makes 80% of the calls.

just the one indexer.

What does 'cardinality is high for the guid' mean?

How does "values(*) as * " help?

Thank you! learning so much!

bowesmana · ‎01-10-2024

OK, so it looks like the main issue is probably values(*) as *, which is taking all values for all fields for each guid, so as it seems you have JSON data containing arrays, there would appear to be more data than just guid/resourceId/sourcenumber.

As for cardinality, what you are doing with stats values(*) as * vs @dtburrows3 version, which is ONLY returning 2 collected fields, is that every field in your data is being collected.

If there is a 1:1 ratio of events to guid, then cardinality is high and you will effectively be returning EVERY single piece of the 10M events to the search head before it can then do the stats count by sourcenumber.

If there are 20 events per guid, then you will get a reduced event count sent to the SH i.e. a lower cardinality., but with potentially 20 values per multivalue field.

So, with this statement, you are returning 3 discrete bits of info

| stats max(eval(if(disposition=="TERMINATED", 1, 0))) as guid_terminated,
        values(sourcenumber) as sourcenumber
  by guid

guid
guid_terminated = 0 or 1 depending whether that guid was terminated
sourcenumber - the values of sourcenumber

Indexed extractions

https://docs.splunk.com/Documentation/SplunkCloud/9.1.2308/Data/Aboutindexedfieldextraction

Tstats

https://docs.splunk.com/Documentation/Splunk/9.1.1/SearchReference/tstats

A good document on how tstats/TERM/PREFIX can massively improve searches, but for JSON data it will not generally help unless indexed extractions are being made.

https://conf.splunk.com/files/2020/slides/PLA1089C.pdf

loganramirez · ‎01-11-2024

Ok, been looking at this.

I did not realize values(*) by * meant to 'look at everything' - I had just seen it in other examples!

I see now that you can do something like this to get just the rows you want:

index="my_data" resourceId="enum*"
| stats values(sourcenumber) as sourcenumber values(destinationnumber) as destinationnumber by guid

Never did I imagine this is FASTER, because I guess I just thought the first line meant it would get all the data anyway. This is a huge help and going to play with it for awhile to get a feel of it (it appears to also remove my need to use mvdedup)

In reading the Indexed Extractions, it notes that you can extract at search (which we are doing above) or 'add custom fields at index' so the latter is what we're talking about doing. At first glance I don't see the performance bump just extracting because I will still have two separate streams (what I called DS1 and DS2).

I think what would be a boon on performance is if I could consolidate the two streams of data into a new stream that had 'just the 12 fields I need' but that feels like a different thread!

This is GREAT INFORMATION and THANK YOU SO MUCH!!!!

bowesmana · ‎01-11-2024

Don't immediately jump down the indexed fields route - there are uses but most of the time search time extraction is sufficient. Adding index extractions will increase storage requirements for data as the raw data AND the extractions are both stored.

With recent developments in Splunk, the use of TERM(XX) in searches can hugely improve search times, as it does not have to look at the raw data to find hits, instead it will look at the "tsidx" files.

dtburrows3 · ‎01-10-2024

Just out of curiosity, how fast does a search like this run?

index="my_data" resourceId="enum*" guid=* (disposition="TERMINATED" OR sourcenumber=*)
    | stats
        max(eval(if(disposition=="TERMINATED", 1, 0))) as guid_terminated,
        values(sourcenumber) as sourcenumber
            by guid
    | where 'guid_terminated'==1,
    | stats
        count as count
            by sourcenumber

I'm going to try and simulate your scenario with a dataset I have locally and see what can be done to speed it up.

loganramirez · ‎01-10-2024

Ok, i had to make some minor changes, things I left out of my original json to simplify, but YES this did run faster!

What is the reason for it tho?

Seems like it avoids the (costly?)

| stats values(*) as * by guid

or passing in the guids from a subsearch.

I'll have to stare at that 'max(eval)' statement for awhile to get it!

Here is my updated SPL and changed i made (some renaming and mvdedup for multivalue field stuff and removed a trailing comma you had after guid_terminated=1)

index="my_data" resourceId="enum*" (disposition.disposition="TERMINATED" OR "connections{}.left.facets{}.number"=*)
| rename "connections{}.left.facets{}.number" as sourcenumber
| eval sourcenumber=mvdedup(sourcenumber)
| rename disposition.disposition as disposition
| stats max(eval(if(disposition=="TERMINATED", 1, 0))) as guid_terminated,
        values(sourcenumber) as sourcenumber
            by guid
    | where 'guid_terminated'==1
    | stats
        count as count
            by sourcenumber
            | sort -count

dtburrows3 · ‎01-10-2024

Oh dang, good catch with the trailing comma!

So just tried to limit the initially called out events as much as possible on the base search with the additional filter of
(disposition.disposition="TERMINATED" OR "connections{}.left.facets{}.number"=*)

and then limiting the stats aggregations to just the fields that are required for your downstream analysis and display.

Glad its running faster!

best/fastest way to match results from two sets of data in the same index

join

subsearch

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?