Splunk Search

Recommended way to join large dataset from multiple data sources?

pmittal
Engager

Hi, I am new to Splunk and have very little knowledge. I am seeking help for following use case:

Query1 gives process data, Query 2 gives container data, Query 3 gives container image data, Query 4 gives container tags data. The way data is joined is:

  1. query 2 and query3 is joined based on image id param.
  2. The resultant data is then joined with query 4 using container id param.
  3. This resultant is then joined with Query 1 using pid param.

To make matter more complex, data from different hosting env such as DC, private cloud, public cloud needs to be joined but problem is some field names are different and needs to be mapped before that. E.g. query 1 will give process data for DC, private cloud, public cloud but not all fields are same. Hence, query 1 can't be directly use to query all three hosting env in one go. Right now, I run query 1 for all hosting env separately and then append data.

I am hitting 50k records limit with join. I went through multiple previous posts and all suggested using stats & avoid using join and append. Example uses two data sources which is not sufficient in my case:

https://community.splunk.com/t5/Splunk-Search/How-to-join-large-tables-with-more-than-50-000-rows-in...

https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549019#M1...

https://community.splunk.com/t5/Splunk-Search/How-to-compare-fields-over-multiple-sourcetypes-withou... 

 

How stats will look like in this case? 

Looking for answer like this: https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549065/hi... 

 

 

 

Labels (2)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

It probably depends on the specifics of your case.

In general, subsearches are limited to 50k events, so whether these come from a subsearch in a join or an append, the limit is the same. So, you have two options, either you break up the subsearch into chunks of fewer than 50k events, or you combine all the events into the initial search.

If you combine the searches, you might be able to use eventstats, rather than just stats, to "join" your events from say query 2 and 3, then use eventstats to "join" these results with query 4, and finally use stats to "join" everything with query 1.

If you are still wanting to use the join command repeatedly with smaller chunks, you should probably use left joins to you don't lose any data not matched to the current chunk.

0 Karma
Get Updates on the Splunk Community!

Say goodbye to manually analyzing phishing and malware threats with Splunk Attack ...

In today’s evolving threat landscape, we understand you’re constantly bombarded with phishing and malware ...

AppDynamics is now part of Splunk Ideas

Hello Splunkers, We have exciting news for you! AppDynamics has been added to the Splunk Ideas Portal. Which ...

Advanced Splunk Data Management Strategies

Join us on Wednesday, May 14, 2025, at 11 AM PDT / 2 PM EDT for an exclusive Tech Talk that delves into ...