Solved: problem joining big data

dadi · ‎12-26-2012

I've 2 big searches that I need to join. Currently I use this paradigm for joining:

search1 OR search2 | stats by joinField | where filterCondition

It is working fine for normal size searches. The problem begins when I try to join really big data (search1 is 300K results while search2 is 1500K results). I'm start seeing that the results are not complete, and some data that was supposed to get joined is not.

What am I'm doing wrong?
What could be done in splunk to achieve my requirement?
(Initially I used the join sub search method, but limited to 50K results it quickly went out of the question)

Thanks.

dadi · ‎12-27-2012

What appear to be solve it for me was to transfer one of the searches to a lookup file (in a pre processing saved search, see lookup documentation). so now instead of:

search1 OR search2 | stats by joinField | where filterCondition

it is now:

search1 | lookup mycsv joinField

View solution in original post

dadi · ‎12-27-2012

What appear to be solve it for me was to transfer one of the searches to a lookup file (in a pre processing saved search, see lookup documentation). so now instead of:

search1 OR search2 | stats by joinField | where filterCondition

it is now:

search1 | lookup mycsv joinField

lguinn2 · ‎12-26-2012

Take a look at the limits.conf file - my guess is that you are exceeding at least one of Splunk's limits. For example, I believe that the default limit for the size of search results is 100MB of memory (that's not the count of events, but the total size). You may exceeding that even if you aren't exceeding the maximum event count.

However, expanding the limits may be not be the best solution, even if it is possible. The limits are set to optimize Splunk's behavior...

Consider using summary indexing; this will allow you to accumulate intermediate results over shorter time periods, and then compute the final results as needed. For example, run this search every 5 minutes, saving your results in an index named mysummary:

search1 OR search2 | sistats by joinField

Assume that this search is named createSummary. Note that it uses the sistats command. After you save the search, you can edit it in the Splunk Manager to set the summary index and the schedule. Once createSummary has been run, you can do the final calculations with this search:

index=mysummary search_name=createSummary | stats by joinField

Of course, my example is pseudo-code, just like yours, but the idea is there. There is also a backfill script that will initialize your summary index over past time periods.

You can look here for more info about summary indexing.

dadi · ‎12-26-2012

Thanks for the answer. I'm already using summery indexing. My problem is the data that should be joined is big. Theoretically, I could split it according to different values of the join field, but it seems to be not the ideal solution.
I'm now trying to work with lookup tables, migrating the results of search 2 to reside in a lookup csv file and then my query looks like:
search1 | lookup mycsv joinField

Does this kind of search have scale limits? assuming the csv file is huge.
Thanks again.

problem joining big data

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Join the Conversation

problem joining big data

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...