Splunk Search

Recommended way to join large dataset from multiple data sources?

pmittal
Engager

Hi, I am new to Splunk and have very little knowledge. I am seeking help for following use case:

Query1 gives process data, Query 2 gives container data, Query 3 gives container image data, Query 4 gives container tags data. The way data is joined is:

  1. query 2 and query3 is joined based on image id param.
  2. The resultant data is then joined with query 4 using container id param.
  3. This resultant is then joined with Query 1 using pid param.

To make matter more complex, data from different hosting env such as DC, private cloud, public cloud needs to be joined but problem is some field names are different and needs to be mapped before that. E.g. query 1 will give process data for DC, private cloud, public cloud but not all fields are same. Hence, query 1 can't be directly use to query all three hosting env in one go. Right now, I run query 1 for all hosting env separately and then append data.

I am hitting 50k records limit with join. I went through multiple previous posts and all suggested using stats & avoid using join and append. Example uses two data sources which is not sufficient in my case:

https://community.splunk.com/t5/Splunk-Search/How-to-join-large-tables-with-more-than-50-000-rows-in...

https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549019#M1...

https://community.splunk.com/t5/Splunk-Search/How-to-compare-fields-over-multiple-sourcetypes-withou... 

 

How stats will look like in this case? 

Looking for answer like this: https://community.splunk.com/t5/Splunk-Search/Large-scale-join-between-two-sourcetypes/m-p/549065/hi... 

 

 

 

Labels (2)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

It probably depends on the specifics of your case.

In general, subsearches are limited to 50k events, so whether these come from a subsearch in a join or an append, the limit is the same. So, you have two options, either you break up the subsearch into chunks of fewer than 50k events, or you combine all the events into the initial search.

If you combine the searches, you might be able to use eventstats, rather than just stats, to "join" your events from say query 2 and 3, then use eventstats to "join" these results with query 4, and finally use stats to "join" everything with query 1.

If you are still wanting to use the join command repeatedly with smaller chunks, you should probably use left joins to you don't lose any data not matched to the current chunk.

0 Karma
Get Updates on the Splunk Community!

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Optimizing SOC workflows with a unified, risk-based approach to Threat Detection, Investigation, and Response ...

Pro Tips for First-Time .conf Attendees: Advice from SplunkTrust

Heading to your first .Conf? You’re in for an unforgettable ride — learning, networking, swag collecting, ...

Raise Your Skills at the .conf25 Builder Bar: Your Splunk Developer Destination

Calling all Splunk developers, custom SPL builders, dashboarders, and Splunkbase app creators – the Builder ...