Splunk Search

Is there a way to use the join command effectively in Hunk?

ddrillic
Ultra Champion

We are trying to see whether the out-of-the-box join command works well in Hunk. We tried the following:

index="claims" | join prov_tin [search index=provider | eval prov_tin = tin]

In claims we have two billion events and in provider we have a couple of million events. The above command runs for some time and returns some results. However, it doesn't even produce a MapR job and therefore it's very slow and it just can't traverse and entire index.

Any ideas?

Tags (2)
0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.

Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.

If you're just searching for one subset of claims you could do this:

(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) |  | stats values(field1) as field1 values(field2) as field2 ... by ID

The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.

View solution in original post

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.

Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.

If you're just searching for one subset of claims you could do this:

(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) |  | stats values(field1) as field1 values(field2) as field2 ... by ID

The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I don't see why it shouldn't. Supply three OR'd sources of data, compute a "join field", run stats, filter.

0 Karma

ddrillic
Ultra Champion

Thank you Martin!!!

Based on what you suggested we did -

(index=claims calc_pd_amt!=0.0) OR (index=provider cos_prov_spcl_cd = 13)
| eval prov_tin = coalesce(tin, prov_tin)
| stats sum(calc_pd_amt) as total, values(cos_prov_spcl_cd) as cos_prov_spcl_cd by prov_tin 
| search total != 0.0 cos_prov_spcl_cd = 13

It works amazingly well with the two billion claims.

The client is now asking whether the solution would work with three tables as well. Their use case, is based on three tables...

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

This inherently has to be either slow or impossible, both in Hunk and Splunk. You're asking for two entire indexes to be returned to the search head to then be joined - that's literally looking for trouble.

Instead, consider more Splunky ways of joining, see https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... or http://wiki.splunk.com/Virtual_.conf (March 2016).

What are you actually trying to achieve?

ddrillic
Ultra Champion

I'm trying to see whether the join command works at all in Hunk at this scale.

The following works for one claim and returns results immediately -

 index="claims"  clm_aud_nbr="585478461201*"

However,

index="claims"  clm_aud_nbr="585478461201*" | join prov_tin [search index=provider | eval prov_tin = tin]

Spins and spins.... after 1/4 billion events the query processing just stops. Apparently, the first query part, until the first pipe tries to retrieve all results from the two billion claims. How can we avoid this lengthy process and return, let's say, just the first match?

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...