Splunk Search

How can we print all fields from a join by stats command?

ddrillic
Ultra Champion

We have the following working query -

(index= primary_claim     amt > 1000                ) OR 
(index=secondary_claim    admis_dt>="2015-10*"      )
| fields primary_key_claim index
| stats dc(index) as dc by primary_key_claim 
| search dc>1

It runs on a 1/3 billion events table and 60 million secondary table just fine.
We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.

Any ideas how to do that?

Tags (2)
0 Karma

woodcock
Esteemed Legend

Try this for inner join on the 2 datasets:

(index= primary_claim amt > 1000) OR (index=secondary_claim admis_dt>="2015-10*")
| stats dc(index) AS dc values(*) AS * BY primary_key_claim 
| search dc>1
| fieds - dc
| outputlookup MyLookupDefinitionName
0 Karma

ddrillic
Ultra Champion

Trying the lookup way -

 (index= primary_claim     amt > 1000                ) OR 
 (index=secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1
 | outputlookup append=false claims_ids.csv.gz

It created a relatively small lookup of 3.6 MBs of 1.2 million ids.

What would be a good way to generate a lookup (or any other object) of the joined two tables?

0 Karma

ddrillic
Ultra Champion

The first join should probably look like -

index= primary_claim     amt > 1000                    
| lookup claims_ids.csv.gz <we need to filter based on the existence of the id in the lookup>

Is it possible?

The outcome of this command should be the claims from primary_claim that their ids exist in the claims_ids.csv.gz lookup.

0 Karma

jkat54
SplunkTrust
SplunkTrust

First of all, we have to agree on terminology.

Using an OR is not the same as Joining data. Joining data uses the join command usually. OR is just a boolean operator saying "I want this data OR this data". Such as "host=abc OR host=def". Which would be different from "host=abc AND host=def" because you're typically not going to find events with two host names that are different like that. The same search with joining would look more like this "host=abc | join commonFieldName [ search host=def]". In such an example the join command would use the commonFieldName as a primary key to join the two sets of data with.

So now that's out of the way...

If you're looking to find ALL of the fields that could be found in either data source this would work:

(index= primary_claim amt > 1000) OR  (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | transpose 1000 | outputlookup fieldsByColumn.csv

OR this works as well (just depends if you\'re looking for single column lookup file or not)

(index= primary_claim amt > 1000) OR  (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | outputlookup fieldsByRow.csv

Now, perhaps this is not what you want at all... but you'd have to rephrase your question if so because this is what I believe you wanted when you said "We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.".

0 Karma

Genti
Splunk Employee
Splunk Employee

Actually, the OR above with the way the stats has been build will do the "joining" as desired.
if you have, name, age in one index, and name, gender in the other index, and if you do
index1 OR index2, you get two events:
Name, Age
Name, Gender

then the | stats dc(index) as dc values(*) as * by Name would return:
Name, Age, Gender, 2
the | search dc>1 then would return only those Names that are present in both indexes, and both Age and Gender show up, which is "All of the fields" from both indexes.

eventually, this is a join, but done via the stats command. it will return what question asked for.

ddrillic
Ultra Champion

@Genti

No luck with | stats dc(index) as dc values() by primary_key_claim -- I hope this is what you mean.

Same com.splunk.mr.JobStartException

0 Karma

Genti
Splunk Employee
Splunk Employee

im not familiar with: com.splunk.mr.*
however, most google searches in relation to this return something in relation to map reduce - i.e., hadoop, i.e., Hunk.

are you using Hunk? still, there should be no reason why the search would return things before the | stats but not after.

so, unto troubleshooting.
what happens if you run | stats dc(index) by
do you get results or same errors?

what if you add one ore two fields at a time i.e.,
| stats dc(index) as dc values(field1) as field1 values(field2) as field2 by

lastly, how many fields would it return? is it possible you have too many columns?/fields?

0 Karma

ddrillic
Ultra Champion

Right, I'm using Hunk.

You said -

-- if you run | stats dc(index) by .... cos_clm_head_sys_id

When running -

 (index= primary_claim     amt > 1000                ) OR 
 (index=secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1

I get 1.2 million events.

I'm trying now -

     (index= primary_claim     amt > 1000                ) OR 
     (index=secondary_claim    admis_dt>="2015-10*"      )
     | fields primary_key_claim index
     | stats dc(index) as dc values(diag_cd) as diag_cd by primary_key_claim 
     | search dc>1

It finishes just fine but the diag_cd column is empty.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Wow, learn something new every day!

Thanks for the correction... i just dont understand how it works this way... but it does...

I thought events would stream through, each with dc(index) equaling to 1, but apparently thats not how it works at all 😉 Thanks again!

0 Karma

jkat54
SplunkTrust
SplunkTrust

It totally blows up my computer when I try to do it too. Seems pretty intense.

0 Karma

cmerriman
Super Champion

Try something like this:

(index= primary_claim   amt > 1000  ) OR    (index= secondary_claim  admis_dt>="2015-10*"  )
| stats dc(index) as dc values(*) as * by primary_key_claim 
| search dc>1
0 Karma

sundareshr
Legend

Do you just want the field name? Have tried the fieldsummary command? Or you can try

| stats dc(index) as dc latest(*) as * by primary_key_claim

Followed by | outputlookup ... command

0 Karma

ddrillic
Ultra Champion

You see, each table has hundreds of fields and I would like to have in the lookup the entire set of fields from the combined two tables.

0 Karma

sundareshr
Legend

Remove the fields command from your query and try the | stats dc(index) as dc latest(*) as * by primary_key_claim

0 Karma

ddrillic
Ultra Champion

Perfect - let me try...

0 Karma

Genti
Splunk Employee
Splunk Employee
(index= primary_claim      amt > 1000                ) OR 
  (index= secondary_claim    admis_dt>="2015-10*"      )
  | stats dc(index) as dc values(*) as * by primary_key_claim 
  | search dc>1
0 Karma

ddrillic
Ultra Champion

@sundareshr

Trying - | stats dc(index) as dc latest(*) as * by primary_key_claim

I get a com.splunk.mr.JobStartException.

@Genti

Trying | stats dc(index) as dc values(*) as * by primary_key_claim

Also here we get com.splunk.mr.JobStartException.

0 Karma

somesoni2
Revered Legend

You can add them in your fields and stats command.

0 Karma

ddrillic
Ultra Champion

Great.

 (index= primary_claim      amt > 1000                ) OR 
 (index= secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index fieldX
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1

So, I added fieldX to the fields command. Where do I add it to the stats command?

0 Karma

somesoni2
Revered Legend

add to stats as values(FieldX) as fieldX. Other options could be first/last/latest/earliest instead of values.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...