Solved: Re: How to join large tables with more than 50,000...

0range · ‎07-22-2014

How do you join large tables?

It is impossible to join tables with more than 50k rows in splunk, so I'm using some tricks, and these tricks are extremely annoying.

Is there any "normal way"?

So, to summarize:
there is no "beautiful" way to join in splunk. We need to use stats, eventstats and brainfuck.

wpreston · ‎08-28-2014

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

View solution in original post

srujan9292 · ‎01-02-2019

I have a similar issue.
What if my data is something like below:

index=a
left join filter (index=a join Q
(index=xyz|table Q)
table filter
)
table *

sideview · ‎03-30-2016

This is bringing back a long-quiet post, but reading this today I see nobody ever gave you a good stats search. try this one. I believe it addresses all the followup issues you identified in other answers.

sourcetype=a OR sourcetype=b OR sourcetype=c
| eval id=if(sourcetype="b",bid,id)
| eval a_time=if(sourcetype="a",_time,null())
| stats values(cid) as cid values(a_time) as a_time by id

somesoni2 · ‎08-28-2014

Try this

(sourcetype=a OR sourcetype=b OR sourcetype=c)
| table sourcetype id bid cid | eventstats first(bid) as bid by cid 
| eval id=coalesce(id,"").coalesce(bid,"") 
| stats values(id) as id values(bid) as bid values(cid) as cid values(sourcetype) as sourcetype by id

martin_mueller · ‎08-29-2014

The dot operator concatenates two strings.

Read "more" here: http://docs.splunk.com/Documentation/Splunk/6.1.3/SearchReference/Eval#Operators

0range · ‎08-29-2014

the description sounds like coalesce(id, bid, "")

where can I read about the "." operator?

somesoni2 · ‎08-29-2014

It should have been just |eval id=coalesce(id,bid) but my data for testing this was not having NULL but blank so used that. It basically takes first non-null value from id or bid and assign it to id.

0range · ‎08-29-2014

what does the point mean?
| eval id=coalesce(id,"").coalesce(bid,"")

wpreston · ‎08-28-2014

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

0range · ‎08-28-2014

// and eval b - unknown... have no time now 😞

It all looks like a tricky and time wasting thing 🙂 I think it's possible to improve

0range · ‎08-28-2014

Ok, I'll try to show what do I really need now.
I suppose not-inner join's become a little problem now.

A. What I've got
1) sourcetype = a | table _time, id
2) sourcetype = b | table bid, cid
3) sourcetype = c | table cid

B. What I need
earliest=-1d@d latest=@d a
left join
earliest=-1d@d b on a.id = b.bid
left join
earliest=-1d@d c on b.cid = c.cid

and gt a table: a._time, a.id, b.bid, c.cid from this join

...to be continued...

martin_mueller · ‎07-23-2014

Here are a few options: http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi...

HiroshiSatoh · ‎07-23-2014

Is it possible to use STATS?

(Example)
index=index_A
ID,NAME
1,AAA
2,BBB
3,CCC
4,DDD

index=index_B
ID,COUNT
1,100
2,200
3,300
4,400

index=index_A OR index=index_B|stats first(NAME) as NAME,first(COUNT) as COUNT by ID

ID,NAME,COUNT
1,AAA,100
2,BBB,200
3,CCC,300
4,DDD,400

0range · ‎08-28-2014

first problem: more than 2 indexes/tables
second problem: different variables for different joins
third problem: different names for the same variable

when I haveto join three indexes A, B, C; and join A with B by id1 and B with C by id2 - it becomes MUCH more complicated. especially when the join-by-fields have different names in different indexes. 20 rows of and awful-formatting search request for such a simple thing... =(

0range · ‎08-28-2014

ye, it's possible, but it is a little bit strange way.

0range · ‎07-23-2014

And what to use instead of join?

My case is joining some data 🙂 from different applications and hence different sources/sourcetypes/indexes.
Several million of rows, at least.

martin_mueller · ‎07-22-2014

The Splunk way to join data is not to use join. What's your use case?

How to join large tables with more than 50,000 rows in Splunk?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

How to join large tables with more than 50,000 rows in Splunk?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases