We have the following working query -
(index= primary_claim amt > 1000 ) OR
(index=secondary_claim admis_dt>="2015-10*" )
| fields primary_key_claim index
| stats dc(index) as dc by primary_key_claim
| search dc>1
It runs on a 1/3 billion events table and 60 million secondary table just fine.
We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.
Any ideas how to do that?
Try this for inner join
on the 2 datasets:
(index= primary_claim amt > 1000) OR (index=secondary_claim admis_dt>="2015-10*")
| stats dc(index) AS dc values(*) AS * BY primary_key_claim
| search dc>1
| fieds - dc
| outputlookup MyLookupDefinitionName
Trying the lookup way -
(index= primary_claim amt > 1000 ) OR
(index=secondary_claim admis_dt>="2015-10*" )
| fields primary_key_claim index
| stats dc(index) as dc by primary_key_claim
| search dc>1
| outputlookup append=false claims_ids.csv.gz
It created a relatively small lookup of 3.6 MBs of 1.2 million ids.
What would be a good way to generate a lookup (or any other object) of the joined two tables?
The first join should probably look like -
index= primary_claim amt > 1000
| lookup claims_ids.csv.gz <we need to filter based on the existence of the id in the lookup>
Is it possible?
The outcome of this command should be the claims from primary_claim
that their ids exist in the claims_ids.csv.gz
lookup.
First of all, we have to agree on terminology.
Using an OR is not the same as Joining data. Joining data uses the join command usually. OR is just a boolean operator saying "I want this data OR this data". Such as "host=abc OR host=def". Which would be different from "host=abc AND host=def" because you're typically not going to find events with two host names that are different like that. The same search with joining would look more like this "host=abc | join commonFieldName [ search host=def]". In such an example the join command would use the commonFieldName as a primary key to join the two sets of data with.
So now that's out of the way...
If you're looking to find ALL of the fields that could be found in either data source this would work:
(index= primary_claim amt > 1000) OR (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | transpose 1000 | outputlookup fieldsByColumn.csv
OR this works as well (just depends if you\'re looking for single column lookup file or not)
(index= primary_claim amt > 1000) OR (index= secondary_claim admis_dt>="2015-10*" ) | fieldsummary | table field | outputlookup fieldsByRow.csv
Now, perhaps this is not what you want at all... but you'd have to rephrase your question if so because this is what I believe you wanted when you said "We would like to print all the fields from both tables into a lookup for the joined set of 1.2 million events.".
Actually, the OR above with the way the stats has been build will do the "joining" as desired.
if you have, name, age in one index, and name, gender in the other index, and if you do
index1 OR index2, you get two events:
Name, Age
Name, Gender
then the | stats dc(index) as dc values(*) as * by Name would return:
Name, Age, Gender, 2
the | search dc>1 then would return only those Names that are present in both indexes, and both Age and Gender show up, which is "All of the fields" from both indexes.
eventually, this is a join, but done via the stats command. it will return what question asked for.
@Genti
No luck with | stats dc(index) as dc values() by primary_key_claim
-- I hope this is what you mean.
Same com.splunk.mr.JobStartException
im not familiar with: com.splunk.mr.*
however, most google searches in relation to this return something in relation to map reduce - i.e., hadoop, i.e., Hunk.
are you using Hunk? still, there should be no reason why the search would return things before the | stats but not after.
so, unto troubleshooting.
what happens if you run | stats dc(index) by
do you get results or same errors?
what if you add one ore two fields at a time i.e.,
| stats dc(index) as dc values(field1) as field1 values(field2) as field2 by
lastly, how many fields would it return? is it possible you have too many columns?/fields?
Right, I'm using Hunk.
You said -
-- if you run | stats dc(index) by .... cos_clm_head_sys_id
When running -
(index= primary_claim amt > 1000 ) OR
(index=secondary_claim admis_dt>="2015-10*" )
| fields primary_key_claim index
| stats dc(index) as dc by primary_key_claim
| search dc>1
I get 1.2 million events.
I'm trying now -
(index= primary_claim amt > 1000 ) OR
(index=secondary_claim admis_dt>="2015-10*" )
| fields primary_key_claim index
| stats dc(index) as dc values(diag_cd) as diag_cd by primary_key_claim
| search dc>1
It finishes just fine but the diag_cd column is empty.
Wow, learn something new every day!
Thanks for the correction... i just dont understand how it works this way... but it does...
I thought events would stream through, each with dc(index) equaling to 1, but apparently thats not how it works at all 😉 Thanks again!
It totally blows up my computer when I try to do it too. Seems pretty intense.
Try something like this:
(index= primary_claim amt > 1000 ) OR (index= secondary_claim admis_dt>="2015-10*" )
| stats dc(index) as dc values(*) as * by primary_key_claim
| search dc>1
Do you just want the field name? Have tried the fieldsummary
command? Or you can try
| stats dc(index) as dc latest(*) as * by primary_key_claim
Followed by | outputlookup ...
command
You see, each table has hundreds of fields and I would like to have in the lookup the entire set of fields from the combined two tables.
Remove the fields command from your query and try the | stats dc(index) as dc latest(*) as * by primary_key_claim
Perfect - let me try...
(index= primary_claim amt > 1000 ) OR
(index= secondary_claim admis_dt>="2015-10*" )
| stats dc(index) as dc values(*) as * by primary_key_claim
| search dc>1
@sundareshr
Trying - | stats dc(index) as dc latest(*) as * by primary_key_claim
I get a com.splunk.mr.JobStartException
.
@Genti
Trying | stats dc(index) as dc values(*) as * by primary_key_claim
Also here we get com.splunk.mr.JobStartException
.
You can add them in your fields and stats command.
Great.
(index= primary_claim amt > 1000 ) OR
(index= secondary_claim admis_dt>="2015-10*" )
| fields primary_key_claim index fieldX
| stats dc(index) as dc by primary_key_claim
| search dc>1
So, I added fieldX
to the fields
command. Where do I add it to the stats
command?
add to stats as values(FieldX) as fieldX
. Other options could be first/last/latest/earliest instead of values.