Splunk Search

How to join large tables with more than 50,000 rows in Splunk?

0range
Communicator

How do you join large tables?

It is impossible to join tables with more than 50k rows in splunk, so I'm using some tricks, and these tricks are extremely annoying.

Is there any "normal way"?


So, to summarize:
there is no "beautiful" way to join in splunk. We need to use stats, eventstats and brainfuck.

Tags (2)
1 Solution

wpreston
Motivator

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

View solution in original post

srujan9292
Explorer

I have a similar issue.
What if my data is something like below:

index=a
left join filter (index=a join Q
(index=xyz|table Q)
table filter
)
table *

0 Karma

sideview
SplunkTrust
SplunkTrust

This is bringing back a long-quiet post, but reading this today I see nobody ever gave you a good stats search. try this one. I believe it addresses all the followup issues you identified in other answers.

sourcetype=a OR sourcetype=b OR sourcetype=c
| eval id=if(sourcetype="b",bid,id)
| eval a_time=if(sourcetype="a",_time,null())
| stats values(cid) as cid values(a_time) as a_time by id
0 Karma

somesoni2
Revered Legend

Try this

(sourcetype=a OR sourcetype=b OR sourcetype=c)
| table sourcetype id bid cid | eventstats first(bid) as bid by cid 
| eval id=coalesce(id,"").coalesce(bid,"") 
| stats values(id) as id values(bid) as bid values(cid) as cid values(sourcetype) as sourcetype by id 
0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The dot operator concatenates two strings.

Read "more" here: http://docs.splunk.com/Documentation/Splunk/6.1.3/SearchReference/Eval#Operators

0 Karma

0range
Communicator

the description sounds like coalesce(id, bid, "")

where can I read about the "." operator?

0 Karma

somesoni2
Revered Legend

It should have been just |eval id=coalesce(id,bid) but my data for testing this was not having NULL but blank so used that. It basically takes first non-null value from id or bid and assign it to id.

0 Karma

0range
Communicator

what does the point mean?
| eval id=coalesce(id,"").coalesce(bid,"")

0 Karma

wpreston
Motivator

I would encourage you not to use the join command. Join is RDBMS thinking, but Splunk works with data differently than an RDBMS does and most of the time join is not needed, nor is it the best way to relate data. If you see this excellent post by MuS, he offers some much more efficient ways of searching across multiple tables (or sourctypes, or whatever it is that differentiates your data) without using join. To the problems that you mentioned:

first problem: more than 2 indexes/tables: This is no problem in Splunk. Searching across multiple indexes/sourctypes is very easy, with no need to join for this operation. I routinely search across multiple sourcetypes without needing to use join.

Second problem: different variables for different joins: We can address this once the details of the different variables for different joins are explained.

Third problem: different names for the same variable: Use eval's coalesce function to make it so that you only have to deal with a single variable name.

If (like @martin_mueller asked) you could share more details of your use case or could share your search, we can help you write a better search.

0range
Communicator

So.
C. What I do
(sourcetype=a OR sourcetype=b OR sourcetype=c)
| eval bid = coalesce(bid, id, cid)
| stats min(_time) as _time, max(cid) as cid by bid
| eval bid = if(cid = bid, null, bid)
| eval cid = coalesce(cid, bid)
| where not (isnull(bid))
| stats max(bid) as bid by cid
| eval a = bid | eval c = if(cid = bid, null, cid)

// and eval b - unknown... have no time now 😞

It all looks like a tricky and time wasting thing 🙂 I think it's possible to improve

0 Karma

0range
Communicator

Ok, I'll try to show what do I really need now.
I suppose not-inner join's become a little problem now.

A. What I've got
1) sourcetype = a | table _time, id
2) sourcetype = b | table bid, cid
3) sourcetype = c | table cid

B. What I need
earliest=-1d@d latest=@d a
left join
earliest=-1d@d b on a.id = b.bid
left join
earliest=-1d@d c on b.cid = c.cid

and gt a table: a._time, a.id, b.bid, c.cid from this join

...to be continued...

0 Karma

HiroshiSatoh
Champion

Is it possible to use STATS?

(Example)
index=index_A
ID,NAME
1,AAA
2,BBB
3,CCC
4,DDD

index=index_B
ID,COUNT
1,100
2,200
3,300
4,400

index=index_A OR index=index_B|stats first(NAME) as NAME,first(COUNT) as COUNT by ID

ID,NAME,COUNT
1,AAA,100
2,BBB,200
3,CCC,300
4,DDD,400

0 Karma

0range
Communicator

first problem: more than 2 indexes/tables
second problem: different variables for different joins
third problem: different names for the same variable

when I haveto join three indexes A, B, C; and join A with B by id1 and B with C by id2 - it becomes MUCH more complicated. especially when the join-by-fields have different names in different indexes. 20 rows of and awful-formatting search request for such a simple thing... =(

0 Karma

0range
Communicator

ye, it's possible, but it is a little bit strange way.

0 Karma

0range
Communicator

And what to use instead of join?

My case is joining some data 🙂 from different applications and hence different sources/sourcetypes/indexes.
Several million of rows, at least.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The Splunk way to join data is not to use join. What's your use case?

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In September, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...