Splunk Search

How to join large tables with more than 50,000 rows in Splunk?

Communicator

How do you join large tables?

It is impossible to join tables with more than 50k rows in splunk, so I'm using some tricks, and these tricks are extremely annoying.

Is there any "normal way"?


So, to summarize:
there is no "beautiful" way to join in splunk. We need to use stats, eventstats and brainfuck.

Tags (2)
1 Solution

Motivator

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

View solution in original post

Explorer

I have a similar issue.
What if my data is something like below:

index=a
left join filter (index=a join Q
(index=xyz|table Q)
table filter
)
table *

0 Karma

SplunkTrust
SplunkTrust

This is bringing back a long-quiet post, but reading this today I see nobody ever gave you a good stats search. try this one. I believe it addresses all the followup issues you identified in other answers.

sourcetype=a OR sourcetype=b OR sourcetype=c
| eval id=if(sourcetype="b",bid,id)
| eval a_time=if(sourcetype="a",_time,null())
| stats values(cid) as cid values(a_time) as a_time by id
0 Karma

Revered Legend

Try this

(sourcetype=a OR sourcetype=b OR sourcetype=c)
| table sourcetype id bid cid | eventstats first(bid) as bid by cid 
| eval id=coalesce(id,"").coalesce(bid,"") 
| stats values(id) as id values(bid) as bid values(cid) as cid values(sourcetype) as sourcetype by id 
0 Karma

SplunkTrust
SplunkTrust

The dot operator concatenates two strings.

Read "more" here: http://docs.splunk.com/Documentation/Splunk/6.1.3/SearchReference/Eval#Operators

0 Karma

Communicator

the description sounds like coalesce(id, bid, "")

where can I read about the "." operator?

0 Karma

Revered Legend

It should have been just |eval id=coalesce(id,bid) but my data for testing this was not having NULL but blank so used that. It basically takes first non-null value from id or bid and assign it to id.

0 Karma

Communicator

what does the point mean?
| eval id=coalesce(id,"").coalesce(bid,"")

0 Karma

Motivator

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

View solution in original post

Communicator

So.
C. What I do
(sourcetype=a OR sourcetype=b OR sourcetype=c)
| eval bid = coalesce(bid, id, cid)
| stats min(_time) as _time, max(cid) as cid by bid
| eval bid = if(cid = bid, null, bid)
| eval cid = coalesce(cid, bid)
| where not (isnull(bid))
| stats max(bid) as bid by cid
| eval a = bid | eval c = if(cid = bid, null, cid)

// and eval b - unknown... have no time now 😞

It all looks like a tricky and time wasting thing 🙂 I think it's possible to improve

0 Karma

Communicator

Ok, I'll try to show what do I really need now.
I suppose not-inner join's become a little problem now.

A. What I've got
1) sourcetype = a | table _time, id
2) sourcetype = b | table bid, cid
3) sourcetype = c | table cid

B. What I need
earliest=-1d@d latest=@d a
left join
earliest=-1d@d b on a.id = b.bid
left join
earliest=-1d@d c on b.cid = c.cid

and gt a table: a._time, a.id, b.bid, c.cid from this join

...to be continued...

0 Karma

SplunkTrust
SplunkTrust
0 Karma

Champion

Is it possible to use STATS?

(Example)
index=index_A
ID,NAME
1,AAA
2,BBB
3,CCC
4,DDD

index=index_B
ID,COUNT
1,100
2,200
3,300
4,400

index=index_A OR index=index_B|stats first(NAME) as NAME,first(COUNT) as COUNT by ID

ID,NAME,COUNT
1,AAA,100
2,BBB,200
3,CCC,300
4,DDD,400

0 Karma

Communicator

first problem: more than 2 indexes/tables
second problem: different variables for different joins
third problem: different names for the same variable

when I haveto join three indexes A, B, C; and join A with B by id1 and B with C by id2 - it becomes MUCH more complicated. especially when the join-by-fields have different names in different indexes. 20 rows of and awful-formatting search request for such a simple thing... =(

0 Karma

Communicator

ye, it's possible, but it is a little bit strange way.

0 Karma

Communicator

And what to use instead of join?

My case is joining some data 🙂 from different applications and hence different sources/sourcetypes/indexes.
Several million of rows, at least.

0 Karma

SplunkTrust
SplunkTrust

The Splunk way to join data is not to use join. What's your use case?

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!