How do you join large tables?
It is impossible to join tables with more than 50k rows in splunk, so I'm using some tricks, and these tricks are extremely annoying.
Is there any "normal way"?
So, to summarize:
there is no "beautiful" way to join in splunk. We need to use stats, eventstats and brainfuck.
I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:
first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.
Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.
Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.
If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.
I have a similar issue.
What if my data is something like below:
index=a
left join filter (index=a join Q
(index=xyz|table Q)
table filter
)
table *
This is bringing back a long-quiet post, but reading this today I see nobody ever gave you a good stats search. try this one. I believe it addresses all the followup issues you identified in other answers.
sourcetype=a OR sourcetype=b OR sourcetype=c
| eval id=if(sourcetype="b",bid,id)
| eval a_time=if(sourcetype="a",_time,null())
| stats values(cid) as cid values(a_time) as a_time by id
Try this
(sourcetype=a OR sourcetype=b OR sourcetype=c)
| table sourcetype id bid cid | eventstats first(bid) as bid by cid
| eval id=coalesce(id,"").coalesce(bid,"")
| stats values(id) as id values(bid) as bid values(cid) as cid values(sourcetype) as sourcetype by id
The dot operator concatenates two strings.
Read "more" here: http://docs.splunk.com/Documentation/Splunk/6.1.3/SearchReference/Eval#Operators
the description sounds like coalesce(id, bid, "")
where can I read about the "." operator?
It should have been just |eval id=coalesce(id,bid) but my data for testing this was not having NULL but blank so used that. It basically takes first non-null value from id or bid and assign it to id.
what does the point mean?
| eval id=coalesce(id,"").coalesce(bid,"")
I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:
first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.
Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.
Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.
If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.
So.
C. What I do
(sourcetype=a OR sourcetype=b OR sourcetype=c)
| eval bid = coalesce(bid, id, cid)
| stats min(_time) as _time, max(cid) as cid by bid
| eval bid = if(cid = bid, null, bid)
| eval cid = coalesce(cid, bid)
| where not (isnull(bid))
| stats max(bid) as bid by cid
| eval a = bid | eval c = if(cid = bid, null, cid)
// and eval b - unknown... have no time now 😞
It all looks like a tricky and time wasting thing 🙂 I think it's possible to improve
Ok, I'll try to show what do I really need now.
I suppose not-inner join's become a little problem now.
A. What I've got
1) sourcetype = a | table _time, id
2) sourcetype = b | table bid, cid
3) sourcetype = c | table cid
B. What I need
earliest=-1d@d latest=@d a
left join
earliest=-1d@d b on a.id = b.bid
left join
earliest=-1d@d c on b.cid = c.cid
and gt a table: a._time, a.id, b.bid, c.cid from this join
...to be continued...
Is it possible to use STATS?
(Example)
index=index_A
ID,NAME
1,AAA
2,BBB
3,CCC
4,DDD
index=index_B
ID,COUNT
1,100
2,200
3,300
4,400
index=index_A OR index=index_B|stats first(NAME) as NAME,first(COUNT) as COUNT by ID
ID,NAME,COUNT
1,AAA,100
2,BBB,200
3,CCC,300
4,DDD,400
first problem: more than 2 indexes/tables
second problem: different variables for different joins
third problem: different names for the same variable
when I haveto join three indexes A, B, C; and join A with B by id1 and B with C by id2 - it becomes MUCH more complicated. especially when the join-by-fields have different names in different indexes. 20 rows of and awful-formatting search request for such a simple thing... =(
ye, it's possible, but it is a little bit strange way.
And what to use instead of join?
My case is joining some data 🙂 from different applications and hence different sources/sourcetypes/indexes.
Several million of rows, at least.
The Splunk way to join data is not to use join
. What's your use case?