How can we print all fields from a join by stats c...

ddrillic · ‎10-20-2016

We have the following working query -

(index= primary_claim     amt > 1000                ) OR 
(index=secondary_claim    admis_dt>="2015-10*"      )
| fields primary_key_claim index
| stats dc(index) as dc by primary_key_claim 
| search dc>1

It runs on a 1/3 billion events table and 60 million secondary table just fine.
We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.

Any ideas how to do that?

woodcock · ‎11-02-2016

Try this for inner join on the 2 datasets:

(index= primary_claim amt > 1000) OR (index=secondary_claim admis_dt>="2015-10*")
| stats dc(index) AS dc values(*) AS * BY primary_key_claim 
| search dc>1
| fieds - dc
| outputlookup MyLookupDefinitionName

ddrillic · ‎10-26-2016

Trying the lookup way -

 (index= primary_claim     amt > 1000                ) OR 
 (index=secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1
 | outputlookup append=false claims_ids.csv.gz

It created a relatively small lookup of 3.6 MBs of 1.2 million ids.

What would be a good way to generate a lookup (or any other object) of the joined two tables?

ddrillic · ‎10-26-2016

The first join should probably look like -

index= primary_claim     amt > 1000                    
| lookup claims_ids.csv.gz <we need to filter based on the existence of the id in the lookup>

Is it possible?

The outcome of this command should be the claims from primary_claim that their ids exist in the claims_ids.csv.gz lookup.

jkat54 · ‎10-20-2016

First of all, we have to agree on terminology.

Using an OR is not the same as Joining data. Joining data uses the join command usually. OR is just a boolean operator saying "I want this data OR this data". Such as "host=abc OR host=def". Which would be different from "host=abc AND host=def" because you're typically not going to find events with two host names that are different like that. The same search with joining would look more like this "host=abc | join commonFieldName [ search host=def]". In such an example the join command would use the commonFieldName as a primary key to join the two sets of data with.

So now that's out of the way...

If you're looking to find ALL of the fields that could be found in either data source this would work:

(index= primary_claim amt > 1000) OR  (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | transpose 1000 | outputlookup fieldsByColumn.csv

OR this works as well (just depends if you\'re looking for single column lookup file or not)

(index= primary_claim amt > 1000) OR  (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | outputlookup fieldsByRow.csv

Now, perhaps this is not what you want at all... but you'd have to rephrase your question if so because this is what I believe you wanted when you said "We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.".

Genti · ‎10-20-2016

Actually, the OR above with the way the stats has been build will do the "joining" as desired.
if you have, name, age in one index, and name, gender in the other index, and if you do
index1 OR index2, you get two events:
Name, Age
Name, Gender

then the | stats dc(index) as dc values(*) as * by Name would return:
Name, Age, Gender, 2
the | search dc>1 then would return only those Names that are present in both indexes, and both Age and Gender show up, which is "All of the fields" from both indexes.

eventually, this is a join, but done via the stats command. it will return what question asked for.

ddrillic · ‎10-22-2016

@Genti

No luck with | stats dc(index) as dc values() by primary_key_claim -- I hope this is what you mean.

Same com.splunk.mr.JobStartException

Genti · ‎10-22-2016

im not familiar with: com.splunk.mr.*
however, most google searches in relation to this return something in relation to map reduce - i.e., hadoop, i.e., Hunk.

are you using Hunk? still, there should be no reason why the search would return things before the | stats but not after.

so, unto troubleshooting.
what happens if you run | stats dc(index) by
do you get results or same errors?

what if you add one ore two fields at a time i.e.,
| stats dc(index) as dc values(field1) as field1 values(field2) as field2 by

lastly, how many fields would it return? is it possible you have too many columns?/fields?

ddrillic · ‎10-23-2016

Right, I'm using Hunk.

You said -

-- if you run | stats dc(index) by .... cos_clm_head_sys_id

When running -

 (index= primary_claim     amt > 1000                ) OR 
 (index=secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1

I get 1.2 million events.

I'm trying now -

     (index= primary_claim     amt > 1000                ) OR 
     (index=secondary_claim    admis_dt>="2015-10*"      )
     | fields primary_key_claim index
     | stats dc(index) as dc values(diag_cd) as diag_cd by primary_key_claim 
     | search dc>1

It finishes just fine but the diag_cd column is empty.

jkat54 · ‎10-21-2016

Wow, learn something new every day!

Thanks for the correction... i just dont understand how it works this way... but it does...

I thought events would stream through, each with dc(index) equaling to 1, but apparently thats not how it works at all 😉 Thanks again!

jkat54 · ‎10-21-2016

It totally blows up my computer when I try to do it too. Seems pretty intense.

cmerriman · ‎10-20-2016

Try something like this:

(index= primary_claim   amt > 1000  ) OR    (index= secondary_claim  admis_dt>="2015-10*"  )
| stats dc(index) as dc values(*) as * by primary_key_claim 
| search dc>1

sundareshr · ‎10-20-2016

Do you just want the field name? Have tried the fieldsummary command? Or you can try

| stats dc(index) as dc latest(*) as * by primary_key_claim

Followed by | outputlookup ... command

ddrillic · ‎10-20-2016

You see, each table has hundreds of fields and I would like to have in the lookup the entire set of fields from the combined two tables.

sundareshr · ‎10-20-2016

Remove the fields command from your query and try the | stats dc(index) as dc latest(*) as * by primary_key_claim

ddrillic · ‎10-20-2016

Perfect - let me try...

Genti · ‎10-20-2016

(index= primary_claim      amt > 1000                ) OR 
  (index= secondary_claim    admis_dt>="2015-10*"      )
  | stats dc(index) as dc values(*) as * by primary_key_claim 
  | search dc>1

ddrillic · ‎10-22-2016

@sundareshr

Trying - | stats dc(index) as dc latest(*) as * by primary_key_claim

I get a com.splunk.mr.JobStartException.

@Genti

Trying | stats dc(index) as dc values(*) as * by primary_key_claim

Also here we get com.splunk.mr.JobStartException.

somesoni2 · ‎10-20-2016

You can add them in your fields and stats command.

ddrillic · ‎10-20-2016

Great.

 (index= primary_claim      amt > 1000                ) OR 
 (index= secondary_claim    admis_dt>="2015-10*"      )
 | fields primary_key_claim index fieldX
 | stats dc(index) as dc by primary_key_claim 
 | search dc>1

So, I added fieldX to the fields command. Where do I add it to the stats command?

somesoni2 · ‎10-20-2016

add to stats as values(FieldX) as fieldX. Other options could be first/last/latest/earliest instead of values.

How can we print all fields from a join by stats command?

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases