Splunk Search

Recommended way to join large dataset from multiple data sources?

pmittal
Engager

Hi, I am new to Splunk and have very little knowledge. I am seeking help for following use case:

Query1 gives process data, Query 2 gives container data, Query 3 gives container image data, Query 4 gives container tags data. The way data is joined is:

  1. query 2 and query3 is joined based on image id param.
  2. The resultant data is then joined with query 4 using container id param.
  3. This resultant is then joined with Query 1 using pid param.

To make matter more complex, data from different hosting env such as DC, private cloud, public cloud needs to be joined but problem is some field names are different and needs to be mapped before that. E.g. query 1 will give process data for DC, private cloud, public cloud but not all fields are same. Hence, query 1 can't be directly use to query all three hosting env in one go. Right now, I run query 1 for all hosting env separately and then append data.

I am hitting 50k records limit with join. I went through multiple previous posts and all suggested using stats & avoid using join and append. Example uses two data sources which is not sufficient in my case:

https://community.splunk.com/t5/Splunk-Search/How-to-join-large-tables-with-more-than-50-000-rows-in...

https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549019#M1...

https://community.splunk.com/t5/Splunk-Search/How-to-compare-fields-over-multiple-sourcetypes-withou... 

 

How stats will look like in this case? 

Looking for answer like this: https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549065/hi... 

 

 

 

Labels (2)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

It probably depends on the specifics of your case.

In general, subsearches are limited to 50k events, so whether these come from a subsearch in a join or an append, the limit is the same. So, you have two options, either you break up the subsearch into chunks of fewer than 50k events, or you combine all the events into the initial search.

If you combine the searches, you might be able to use eventstats, rather than just stats, to "join" your events from say query 2 and 3, then use eventstats to "join" these results with query 4, and finally use stats to "join" everything with query 1.

If you are still wanting to use the join command repeatedly with smaller chunks, you should probably use left joins to you don't lose any data not matched to the current chunk.

0 Karma
Get Updates on the Splunk Community!

Splunk ITSI & Correlated Network Visibility

  Now On Demand   Take Your Network Visibility to the Next Level In today’s complex IT environments, ...

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

  Now On Demand  Stay ahead of today’s evolving threats with the combined power of the Splunk Threat Research ...

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Automated Archival is a new capability within Metrics Management; which is a robust usage & cost optimization ...