join on billions of records

splunkannm · ‎05-03-2018

My dataset that I need to join on is about half a billion. Since a subsearch is slow what is the alternative to using the join?

woodcock · ‎05-05-2018

Try this:

(index="abc" AND (sourcetype="a" OR sourcetype="b")  AND [|makeresults | eval search="_time>=" . relative_time(now(), "-36mon") | rex field=search "s/\"//g" | table search ])
OR (index="abc" AND sourcetype="c"))

| fields ACCT_NUM1 ACCT_NUM2 *
| eval right_acct_num=case(
   sourcetype=="a", ACCT_NUM1,
   sourcetype=="b", ACCT_NUM2,
   sourcetype=="c", acct_num,
   true(), "ERROR")
| stats values(*) as * BY right_acct_num
| rename COMMENT AS "Double-check me here; 'b' is from the first right join and the second left join."
| where sourcetype="b"

somesoni2 · ‎05-03-2018

Give this a try

((index="abc") (sourcetype="a" OR sourcetype="b" OR sourcetype="c" ))  earliest=-36months 
| fields - _raw date* punct time* 
| fields ACCT_NUM1 ACCT_NUM2 *
| eval "right_acct_num"=case(sourcetype=="a", ACCT_NUM1,sourcetype=="b",ACCT_NUM2,true(),acct_num) 
| stats values(sourcetype) as sourcetypes latest(*) as * by right_acct_num     
| where NOT (mvcount(sourcetypes)=1 AND sourcetypes='c")

woodcock · ‎05-03-2018

Show us your entire search and we can help. You have given us nothing to work with. See some of my posts about using appendpipe instead and maybe you can figure out something.

splunkannm · ‎05-03-2018

((index="abc") (sourcetype="a" OR sourcetype="b")) earliest=-36months
| fields ACCT_NUM1 ACCT_NUM2 *
| eval "right_acct_num"=if(sourcetype=="a", ACCT_NUM1,ACCT_NUM2)
| stats latest() as * by right_acct_num

| join type="left" right_acct_num
[ search index = abc sourcetype=c
|rename acct_num as right_acct_num
| fields - _raw date punct time*
| table * ]

The table that is being joined on has about 500 million records.

somesoni2 · ‎05-03-2018

That would depends upon the query that you're using currently. At high level, following gives the possible alternative commands:
http://docs.splunk.com/Documentation/Splunk/7.1.0/SearchReference/Join#Alternative_commands

splunkannm · ‎05-03-2018

Its a simple left join on one filed, just that the number of records are huge and the number of fields also are too many

somesoni2 · ‎05-03-2018

Would you mind sharing your current query (mask anything sensitive) and tell us what type of reporting you want to do after join those records? Sometimes two queries can be merged into one at base search level itself (the best option with least amount of limitations) or sometimes they need to be appended using subsearch (subsearch used for smallest dataset between two). So, it's case by case basis.

join on billions of records

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!